[Dovecot] lmtp-proxying in 2.1 slower than in 2.0.14 ?
We upgraded our two dovecot directors from v2.0.14 to dovecot-ee 2.1.10.3 this week, and after that mail seems to be flowing a lot slower than before. The backend mailstores are untouched, on v2.0.14 still. After the upgrade we've been hitting process_limit for lmtp a lot, and we're struggeling with large queues in the incoming mailservers that are using LMTP virtual transport towards our two directors.
I seem to remember 2.1 should have a new lmtp-proxying code. Is there anything in this that maybe needs to be tuned that's different from v2.0 ? I'm a bit scheptical to just increasing the process_limit for LMTP proxying, as I doubt running many hundreds of simultaneous deliveries should work that much better against the backend storage..
###### doveconf -n ########## # 2.1.10.3: /etc/dovecot/dovecot.conf # OS: Linux 2.6.18-194.32.1.el5 x86_64 Red Hat Enterprise Linux Server release 5.5 (Tikanga) default_client_limit = 4000 director_mail_servers = 192.168.42.7 192.168.42.8 192.168.42.9 192.168.42.10 192.168.42.28 192.168.42.29 director_servers = 192.168.42.15 192.168.42.17 disable_plaintext_auth = no listen = * lmtp_proxy = yes managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date ihave passdb { args = proxy=y nopassword=y driver = static } protocols = imap pop3 lmtp sieve service anvil { client_limit = 6247 } service auth { client_limit = 8292 unix_listener auth-userdb { user = dovecot } } service director { fifo_listener login/proxy-notify { mode = 0666 } inet_listener { port = 5515 } unix_listener director-userdb { mode = 0600 } unix_listener login/director { mode = 0666 } } service imap-login { executable = imap-login director process_limit = 4096 process_min_avail = 4 service_count = 0 vsz_limit = 256 M } service lmtp { inet_listener lmtp { address = * port = 24 } process_limit = 100 } service managesieve-login { executable = managesieve-login director inet_listener sieve { address = * port = 4190 } process_limit = 50 } service pop3-login { executable = pop3-login director process_limit = 2048 process_min_avail = 4 service_count = 0 vsz_limit = 256 M } ssl_cert =
################
-jf
On 1.2.2013, at 19.00, Jan-Frode Myklebust janfrode@tanso.net wrote:
We upgraded our two dovecot directors from v2.0.14 to dovecot-ee 2.1.10.3 this week, and after that mail seems to be flowing a lot slower than before. The backend mailstores are untouched, on v2.0.14 still. After the upgrade we've been hitting process_limit for lmtp a lot, and we're struggeling with large queues in the incoming mailservers that are using LMTP virtual transport towards our two directors.
I seem to remember 2.1 should have a new lmtp-proxying code. Is there anything in this that maybe needs to be tuned that's different from v2.0 ? I'm a bit scheptical to just increasing the process_limit for LMTP proxying, as I doubt running many hundreds of simultaneous deliveries should work that much better against the backend storage..
Hmm. The main difference is that v2.1 writes temporary files to mail_temp_dir. If that's in tmpfs (and probably even if it isn't), it should still be pretty fast..
Have you checked if there's an increase in disk I/O usage, or system cpu usage?
Or actually .. It could simply be that in v2.0.15 service lmtp { client_limit } default was changed to 1 (from default_client_limit=1000). This is important with the backend, because writing to message store can be slow, but proxying should be able to handle more than 1 client per process, even with the new temporary file writing. So you could see if it helps to set lmtp { client_limit = 100 } or something.
On Fri, Feb 1, 2013 at 11:00 PM, Timo Sirainen tss@iki.fi wrote:
On 1.2.2013, at 19.00, Jan-Frode Myklebust janfrode@tanso.net wrote:
Have you checked if there's an increase in disk I/O usage, or system cpu usage?
On the directors, cpu usage, and load averages seems to have gone down by about 50% since the upgrade. On the backend mailstores running 2.0.14 I see no effect (but these are quite busy, so less LMTP might just have lead to better response on other services).
Or actually .. It could simply be that in v2.0.15 service lmtp { client_limit } default was changed to 1 (from default_client_limit=1000). This is important with the backend, because writing to message store can be slow, but proxying should be able to handle more than 1 client per process, even with the new temporary file writing. So you could see if it helps to set lmtp { client_limit = 100 } or something.
My backend lmtp services are configured with client_limit = 1, process_limit = 25, and there are 6 backends I.e. max 150 backend LMTP processes if all lmtp is spread evenly between the backends, which it woun't be since backends are weighted differently (2x 50, 2x75 and 2x100).
I assume each director will max proxy process_limit*client_limit to my backends. Will it be OK to have a much higher process_limit*client_limit on the directors than on the backends? It will not be a problem if directors are configured to seemingly handle a lot more simultaneous connections than the backends?
-jf
On 2.2.2013, at 12.59, Jan-Frode Myklebust janfrode@tanso.net wrote:
Or actually .. It could simply be that in v2.0.15 service lmtp { client_limit } default was changed to 1 (from default_client_limit=1000). This is important with the backend, because writing to message store can be slow, but proxying should be able to handle more than 1 client per process, even with the new temporary file writing. So you could see if it helps to set lmtp { client_limit = 100 } or something.
My backend lmtp services are configured with client_limit = 1, process_limit = 25, and there are 6 backends I.e. max 150 backend LMTP processes if all lmtp is spread evenly between the backends, which it woun't be since backends are weighted differently (2x 50, 2x75 and 2x100).
I assume each director will max proxy process_limit*client_limit to my backends. Will it be OK to have a much higher process_limit*client_limit on the directors than on the backends? It will not be a problem if directors are configured to seemingly handle a lot more simultaneous connections than the backends?
Best to keep the bottleneck closest to MTA. If director can handle more connections than backend, then MTA is uselessly waiting on the extra LMTP connections to timeout. So I'd keep the director's process_limit*client_limit somewhat close to what backends can handle (somewhat more is probably ok too). Anyway, if backend reaches the limit it logs a warning about it and then just doesn't accept the connections until one of the existing ones finish.
I think there must be some bug I'm hitting here. One of my directors is still running with "client_limit = 1, process_limit = 100" for the lmtp service, and now it's logging:
master: Warning: service(lmtp): process_limit (100) reached, client connections are being dropped
Checking "sudo netstat -anp|grep ":24 " I see 287 ports in TIME_WAIT, one in CLOSE_WAIT and the listening "0.0.0.0:24". No active connections. There are 100 lmtp-processes running. When trying to connect to the lmtp-port I immediately get dropped:
$ telnet localhost 24 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. Connection closed by foreign host.
Is there maybe some counter that's getting out of sync, or some back off penalty algorithm that kicks in when it first hit the process limit ?
-jf
On 5.2.2013, at 11.57, Jan-Frode Myklebust janfrode@tanso.net wrote:
I think there must be some bug I'm hitting here. One of my directors is still running with "client_limit = 1, process_limit = 100" for the lmtp service, and now it's logging:
master: Warning: service(lmtp): process_limit (100) reached, client connections are being dropped
Checking "sudo netstat -anp|grep ":24 " I see 287 ports in TIME_WAIT, one in CLOSE_WAIT and the listening "0.0.0.0:24". No active connections. There are 100 lmtp-processes running.
Sounds like the LMTP processes are hanging for some reason.. http://hg.dovecot.org/dovecot-2.1/rev/63117ab893dc might show something interesting, although I'm pretty sure it will just say that the processes are hanging in DATA command.
Other interesting things to check:
gdb -p <pid of lmtp process> bt full
strace -tt -p <pid of lmtp process> (for a few seconds to see if anything is happening)
If lmtp proxy is hanging, it should have a timeout (default 30 secs) and it should log about it if it triggers. (Although maybe not to error log.)
When trying to connect to the lmtp-port I immediately get dropped:
$ telnet localhost 24 Trying 127.0.0.1... Connected to localhost.localdomain (127.0.0.1). Escape character is '^]'. Connection closed by foreign host.
This happens when the master process notices that all the service processes are full.
Is there maybe some counter that's getting out of sync, or some back off penalty algorithm that kicks in when it first hit the process limit ?
Shouldn't be, but the proctitle patch should make it clearer. Strange anyway, I haven't heard of anything like this happening before.
participants (2)
-
Jan-Frode Myklebust
-
Timo Sirainen