[Dovecot] Degeneration of CPU Performance
Hello everybody!
We have a interesting issue about dovecot behavior here. First, the
scenario: We have 2 server running with the same load, one with our old pop3 solution (out of date) and other with Dovecot. We realized that dovecot are comsuming more CPU, and this consumption is growing day by day. When we starts dovecot, it runs between 40%-45% of CPU consumption and our old solution runs on 30-35%. This is quite acceptable, so no problem here. The problem is one day after it jumps to 45%-55% of cpu comsunption while the old pop3 solution runs on the same CPU consumption of one day before (30%-35). I attached a graph with this information.
In this graph the green line is dovecot CPU consumption and the blue line is the old solution.
In this graph we restart dovecot at monday morning. This restart is noticed as a big fall of CPU consumption(green line) in the graph. After that, it stay between 40%-45%, one day after it runs on 45%-55% and next day, it reaches 50%-60%! You can notice that the blue line (old pop3 solution) runs with the same behavior, so we have the same load. We have other monitoring informations that confirm it.
I looked for some bugs about this, but I didn't find anything. The question is: Is there any dovecot problem or wrong configuration that could cause this degeneration of CPU performance? Any sugestion about what can be the cause of this?
I attached dovecot -n output. Two important information: we store the mailboxes in NFS and we are only using pop3 protocol.
Thanks in advance!
-- Thiago Monaco Papageorgiou <thiago.monaco@corp.terra.com.br>
Terra Networks Brasil S/A Tel: +55 (51) 3284-4274
# 1.1.6: /usr/local/dovecot/etc/dovecot.conf # OS: Linux 2.6.9-78.0.1.ELsmp i686 CentOS release 4.7 (Final) syslog_facility: local2 protocols: pop3 listen: *:1110 ssl_disable: yes disable_plaintext_auth: no login_dir: /usr/local/dovecot/var/run/dovecot/login login_executable: /usr/local/dovecot/libexec/dovecot/pop3-login login_user: nobody login_log_format_elements: user=<%u> idperm=<%i> command=%t method=%m rip=%r lip=%l %c login_process_per_connection: no login_process_size: 32 login_processes_count: 100 login_max_processes_count: 500 login_max_connections: 100 mail_uid: popmgr mail_gid: popmgr mail_cache_min_mail_count: 100 mmap_disable: yes mail_nfs_storage: yes mail_nfs_index: yes lock_method: dotlock maildir_copy_preserve_filename: yes mail_executable: /usr/local/dovecot/libexec/dovecot/pop3 mail_plugins: hashdir mail_plugin_dir: /usr/local/dovecot/lib/dovecot/pop3 mail_log_prefix: %Us: user %u (%e): pop3_uidl_format: %f pop3_client_workarounds: outlook-no-nuls oe-ns-eoh auth default: mechanisms: plain trrproxy_v1 worker_max_count: 64 passdb: driver: trrpam args: extra_fields=short_name,quota_caixa,mail_folders_limit,id_perm,mail_imap %s_dovecot userdb: driver: prefetch plugin: quota: trrquota uidlist_create_form: 1
Thiago Monaco Papageorgiou wrote:
We have 2 server running with the same load, one with our old pop3 solution (out of date) and other with Dovecot. We realized that dovecot are comsuming more CPU, and this consumption is growing day by day. When we starts dovecot, it runs between 40%-45% of CPU consumption and our old solution runs on 30-35%. This is quite acceptable, so no problem here. The problem is one day after it jumps to 45%-55% of cpu comsunption while the old pop3 solution runs on the same CPU consumption of one day before (30%-35). I attached a graph with this information.
Are your using leaving mail on the server? If not, you may find it advantageous to disable the indexing, since it's of no real use for "drive by collect" mail.
In this graph the green line is dovecot CPU consumption and the blue line is the old solution.
In this graph we restart dovecot at monday morning. This restart is noticed as a big fall of CPU consumption(green line) in the graph. After that, it stay between 40%-45%, one day after it runs on 45%-55% and next day, it reaches 50%-60%! You can notice that the blue line (old pop3 solution) runs with the same behavior, so we have the same load. We have other monitoring informations that confirm it.
This, to me, is consistent with Dovecot spending a lot of time indexing the overnight deliveries as everyone logs on in the morning.
Are you using dovecot deliver as your LDA?
Is there any dovecot problem or wrong configuration that could cause this degeneration of CPU performance? Any sugestion about what can be the cause of this?
It could be that Dovecot is doing work you don't need it to. By which I mean building indices.
I attached dovecot -n output. Two important information: we store the mailboxes in NFS and we are only using pop3 protocol.
As I said above, if your users are using this service ONLY to collect mail, not to store it, then the indexes Dovecot tries to maintain are a waste of effort.
You can read more about POP3 configuration in the wiki at: http://wiki.dovecot.org/POP3Server
The MailLocation page also has some notes about index file placement, including:
""" If you really want to, you can also disable the index files completely by appending :INDEX=MEMORY. """
-- Curtis Maloney cmaloney@cardgate.net
Hi Curtis, thanks for your replies. below my replies:
Curtis Maloney wrote:
Are your using leaving mail on the server? If not, you may find it advantageous to disable the indexing, since it's of no real use for "drive by collect" mail.
Yes, we have some users that leave the messages in the server (webmail users) and we have users that download messages to local clients defore erase it.
This, to me, is consistent with Dovecot spending a lot of time indexing the overnight deliveries as everyone logs on in the morning.
Are you using dovecot deliver as your LDA?
We don't use dovecot LDA solution.
As I said above, if your users are using this service ONLY to collect mail, not to store it, then the indexes Dovecot tries to maintain are a waste of effort.
You can read more about POP3 configuration in the wiki at: http://wiki.dovecot.org/POP3Server
The MailLocation page also has some notes about index file placement, including:
""" If you really want to, you can also disable the index files completely by appending :INDEX=MEMORY. """
We are already doing it. All our mailboxes were renamed with this string appended. We already tried use index and not use it, use cache and not use cache also, but this performance degeneration still happening.
We have no relevant issue about memory, I/O, network consumption. Everything seems to be fine (we have monitoring it and more things).
We don't care if dovecot use more CPU than our current solution, dovecot is more reliable and secure. The problem is this performance getting worse every day. We can't restart dovecot each 2 days because it is consuming too much CPU.
Thanks again for the help.
-- Thiago Monaco Papageorgiou <thiago.monaco@corp.terra.com.br>
Terra Networks Brasil S/A Tel: +55 (51) 3284-4274
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, 7 May 2009, Thiago Monaco Papageorgiou wrote:
I looked for some bugs about this, but I didn't find anything. The question
v1.1.6 is pretty old, though.
I attached dovecot -n output. Two important information: we store the mailboxes in NFS and we are only using pop3 protocol.
What's hashdir? Maybe this is a performance killer?
Did you checked things like memory consumption and if there are processes which do consume lots of CPU, e.g. via "top" or similiar command. Maybe your server spends the time on the network?
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBSgPba3WSIuGy1ktrAQLNvAgAxV3Kb5j1SzihB318ECzvLEN1gazzptD8 E0ie735sCL2B4xpD/Iz1MlJbBbvDBoGvYt67Bn/DKpZ0lWF73frw3D3ITvvssGSW J6cSfc6jDbaejwDPrUAByJyZkHyBpSQSBFdyvBWBfwHPeQn4YUFcyHLQW7O+9BDg Ld9oh/2ALmhwAC3Rt7CAgLcS0XNyKolJY+kpUgoW/V8BKXVley6ud4nqlAVuFGsL LSdKNof2MNEU1zWaD16GenXx6RiDLvbdXvZmZbROYRCXolj+gICwORY8s/o8GYUC LvXXPwKCnMYBW6ZD4dvHE8K9vwL/mw8qMZpdRtMbbMGhA5qz9TugGw== =ga6+ -----END PGP SIGNATURE-----
On 5/8/2009, Steffen Kaiser (skdovecot@smail.inf.fh-brs.de) wrote:
I looked for some bugs about this, but I didn't find anything. The question
v1.1.6 is pretty old, though.
Yeah, and a lot of the fixes since then were NFS related...
I'd upgrade and see if it fixes it (can't hurt in any case)...
--
Best regards,
Charles
hi Steffen, thanks for your reply. Below my replies:
Steffen Kaiser wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
v1.1.6 is pretty old, though. Right! update is my next step
What's hashdir? Maybe this is a performance killer? This run in the mail process and it is killed after the end of session, so I don't think this could cause performance degeneration.
Did you checked things like memory consumption and if there are processes which do consume lots of CPU, e.g. via "top" or similiar command. Maybe your server spends the time on the network?
We have monitoring memory, I/O, network consumption and everything seems to be fine.
This is a snapshot of a top :
%CPU %MEM
60.4 1.8 495:12.52 dovecot
4.2 0.6 0:27.53 dovecot-auth
1.0 0.0 2:04.48 pop3-login <- lots of
After reboot: 4.9 2.8 2:20.13 dovecot-auth
4.3 0.1 2:16.56 dovecot 1.0 0.0 0:00.03 pop3
This is a rough way to measure the consumption, but i think it still valid since we have a big difference. I will make some better measures of CPU usage and send here soon.
I was think if is there any kind of structure that dovecot keep in memory, it would have small growth rate and not make much difference to memory consumption, but makes dovecot process a lot more each day. Maybe a structure that needs upgrade after each access...
Thanks everyone! I really appreciate your help!
-- Thiago Monaco Papageorgiou <thiago.monaco@corp.terra.com.br>
Terra Networks Brasil S/A Tel: +55 (51) 3284-4274
On Fri, 2009-05-08 at 17:11 -0300, Thiago Monaco Papageorgiou wrote:
This is a snapshot of a top : %CPU %MEM
60.4 1.8 495:12.52 dovecot
Oh, the dovecot master process is eating the CPU. That's interesting. This is the first time I've heard it doing that. What does it show when you do strace -p -tt to it for a second or so?
login_process_per_connection: no login_process_size: 32 login_processes_count: 100 login_max_connections: 100
I don't think you should have that many of login processes. login_process_count should be about the same number as you have CPUs/cores. Rather increase the login_max_connections if you need to.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
login_process_per_connection: no login_process_size: 32 login_processes_count: 100 login_max_connections: 100
s. http://wiki.dovecot.org/LoginProcess
"The maximum number of users logging in at the same time (+ SSL/TLS proxying connections) is login_max_connections * login_max_processes_count."
Is it possible you are hitting the "max num of users" boundary? Dovecot needs one login process per lifetime of a SSL/TLS connection, too.
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBSgfJLXWSIuGy1ktrAQLOlwf/Ylt/2GsuKwFtbOrpI9COQJEyaRgKgl+a SksKgivZaT48I0gxRGRF8Fss7J6NRuY0nzPBcHRfo2F+DKTjHOuJI3cbnKlWA1jg 8xvqCbdtOTESuJZ/V1Ri6t8dSkI1VSdon1PADCWhkrTpqsZAYy44rZBCcIdrG8AT xGba0fKfYfTq2JnkMwsow0n93kPWzOOwXBBsPf6/5ixHEry2TcJMTB32w0eXi0pE L26PiCsNWZiQ1EYGUjn4sV4j4LyJE2lVjznXkkNCwDwIjuFhjFsUf48TcTugzYYn wCp8BVRruWWc3T1UsWRg3zYhl1oBgzcooojoqrqtA/9MZj3b2OkSEw== =OPXj -----END PGP SIGNATURE-----
On Thu, 2009-05-07 at 15:40 -0300, Thiago Monaco Papageorgiou wrote:
Hello everybody!
We have a interesting issue about dovecot behavior here. First, the
scenario: We have 2 server running with the same load, one with our old pop3 solution (out of date) and other with Dovecot. We realized that dovecot are comsuming more CPU, and this consumption is growing day by day. When we starts dovecot, it runs between 40%-45% of CPU consumption and our old solution runs on 30-35%. This is quite acceptable, so no problem here. The problem is one day after it jumps to 45%-55% of cpu comsunption while the old pop3 solution runs on the same CPU consumption of one day before (30%-35). I attached a graph with this information.
I see you're using NFS and Linux. We've seen something similar.
Try to find out where this CPU time is being spent - in the kernel, or in userland. 'top' will tell you, just start it, and look at the second or 3rd line where it says 'CPU'. 'us' is user time, 'sy' is system time, and 'si' is 'system interrupt'. The latter two are time spent in the kernel.
If all CPU is used by 'us' then it's really dovecot that is eating cycles. If it is 'sy' or 'si' it's the kernel.
In that case, you might want to upgrade. Upgrade to 2.6.27.10 at least, 2.7.27.latest preferably - many NFS bugs have been fixed there. I have no idea if there is a drop-in CentOS kernel >= 2.6.27.10 - you might have to compile your own kernel.
Mike.
participants (6)
-
Charles Marcus
-
Curtis Maloney
-
Miquel van Smoorenburg
-
Steffen Kaiser
-
Thiago Monaco Papageorgiou
-
Timo Sirainen