[Dovecot] director lmtp -> smtp problem
Hi Timo & Dovecot users,
We have a 2-node director setup which front-ends for 4 nodes which share a clustered filesystem (GFS). All nodes run Dovecot 2.0.18. Approximately 40k users, but typically only a few thousand active at any time.
The director nodes run sendmail, which deliver mail "locally" using LMTP to the director, which then feeds to SMTP on the real servers (also sendmail.) Why sendmail? Because procmail is used for mail filtering and as the delivery agent.
Here's the problem, on the director:
Mar 14 20:40:08 imapdir2 dovecot: lmtp(10692): Connect from local Mar 14 20:40:38 imapdir2 dovecot: lmtp(10692): Panic: file lmtp-proxy.c: line 376 (lmtp_proxy_output_timeout): assertion failed: (proxy->data_input ->eof) Mar 14 20:40:38 imapdir2 dovecot: lmtp(10692): Error: Raw backtrace: /usr/lib/dovecot/libdovecot.so.0(+0x3d99a) [0x7f79156c499a] -> /usr/lib/doveco t/libdovecot.so.0(+0x3d9e6) [0x7f79156c49e6] -> /usr/lib/dovecot/libdovecot.so.0(i_error+0) [0x7f791569df8f] -> dovecot/lmtp() [0x406e77] -> /usr/l ib/dovecot/libdovecot.so.0(io_loop_handle_timeouts+0xd4) [0x7f79156d0044] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_handler_run+0x5b) [0x7f79156d 0c3b] -> /usr/lib/dovecot/libdovecot.so.0(io_loop_run+0x28) [0x7f79156cfca8] -> /usr/lib/dovecot/libdovecot.so.0(master_service_run+0x13) [0x7f7915 6bdfc3] -> dovecot/lmtp(main+0x154) [0x403f84] -> /lib64/libc.so.6(__libc_start_main+0xfd) [0x7f7914ef8cdd] -> dovecot/lmtp() [0x403d69] Mar 14 20:40:38 imapdir2 sendmail[6905]: q2D8KodI018432: SYSERR(root): timeout writing message to localhost: Broken pipe
Most mail goes through OK, but some messages do not and end up queued until they run into the queue time limit.
So far as I have been able to tell, all of the messages have this failure when the following conversation takes place between sendmail (on director), the Dovecot LMTP proxy, and sendmail on the backend node (SMTP):
(names mangled to protect the guilty)
(first, sendmail -> director LMTP)
[root@imapdir2 ~]# sendmail -qIq2EFZt1p004708 -v
Running /var/spool/mqueue/qd2/q2EFZt1p004708 (sequence 1 of 1) <ntssdfwe@pobox2.uvm.edu>... Connecting to /var/lib/dovecot/lmtp-socket via cyrusv2... 220 imapdir2.uvm.edu Dovecot LMTP ready
LHLO imapdir2.uvm.edu 250-imapdir2.uvm.edu 250-8BITMIME 250-ENHANCEDSTATUSCODES 250 PIPELINING MAIL From:<owner-dasfqeiuasd*Abrasf*-Gorpwe**UVM*-EDU@LIST.UVM.EDU> 250 2.1.0 OK RCPT To:<ntssdfwe> DATA 250 2.1.5 OK 354 OK timeout writing message to localhost: Broken pipe <ntssdfwe@pobox2.uvm.edu>... Deferred Closing connection to localhost
The conversation between the director (LMTP) and the backend (sendmail SMTP) goes like this:
250-penguinc.uvm.edu Hello imapdir2.uvm.edu [132.198.100.150], pleased to meet you 250-ENHANCEDSTATUSCODES 250-PIPELINING 250-8BITMIME 250-SIZE 10485760 250-ETRN 250-AUTH DIGEST-MD5 CRAM-MD5 250-DELIVERBY 250 HELP MAIL FROM:<owner-dasfqeiuasd*Abrasf*-Gorpwe**UVM*-EDU@LIST.UVM.EDU> 250 2.1.0 <owner-dasfqeiuasd*Abrasf*-Gorpwe**UVM*-EDU@LIST.UVM.EDU>... Sender ok RCPT TO:<ntssdfwe> 552 5.2.2 User ntssdfwe mailbox is full
At this point Dovecot should return the failed RCPT TO: status back to sendmail over LMTP, but instead it sits there (waiting for a timeout to expire?) and eventually dies.
doveconf -n output:
# 2.0.18: /etc/dovecot/dovecot.conf # OS: Linux 2.6.32-220.4.2.el6.x86_64 x86_64 Red Hat Enterprise Linux Server release 6.2 (Santiago) base_dir = /var/run/dovecot/ default_client_limit = 6000 default_process_limit = 10240 director_mail_servers = penguina.uvm.edu penguinb.uvm.edu penguinc.uvm.edu penguind.uvm.edu director_servers = imapdir1.uvm.edu imapdir2.uvm.edu lmtp_proxy = yes login_trusted_networks = [REDACTED] passdb { args = proxy=y nopassword=y protocol=smtp driver = static } service anvil { client_limit = 40000 } service auth { client_limit = 45960 unix_listener auth-userdb { group = mail mode = 0660 user = dovecot } } service director { fifo_listener login/proxy-notify { mode = 0666 } inet_listener { port = 9090 } unix_listener director-userdb { mode = 0600 } unix_listener login/director { mode = 0666 } } service imap-login { executable = imap-login director service_count = 0 } service imap { process_limit = 10240 vsz_limit = 1 G } service lmtp { client_limit = 1 inet_listener lmtp { port = 24 } unix_listener /var/lib/dovecot/lmtp-socket { group = root mode = 0600 user = root } } service pop3-login { executable = pop3-login director service_count = 0 } service pop3 { process_limit = 5000 } shutdown_clients = no ssl_cert = <[REDACTED].pem ssl_key = <[REDACTED].key userdb { driver = passwd } verbose_proctitle = yes version_ignore = yes protocol lmtp { auth_socket_path = director-userdb } protocol imap { mail_max_userip_connections = 100 }
Hope you can help, Jim Lawson
Hi,
On 15.3.2012, at 3.24, Jim Lawson wrote:
We have a 2-node director setup which front-ends for 4 nodes which share a clustered filesystem (GFS). All nodes run Dovecot 2.0.18. .. Mar 14 20:40:38 imapdir2 dovecot: lmtp(10692): Panic: file lmtp-proxy.c: line 376 (lmtp_proxy_output_timeout): assertion failed: (proxy->data_input ->eof)
I pretty much rewrote the LMTP proxying code in v2.1, so there's a very good chance that it's already been fixed.
On 3/15/12 6:02 AM, Timo Sirainen wrote:
Hi,
On 15.3.2012, at 3.24, Jim Lawson wrote:
We have a 2-node director setup which front-ends for 4 nodes which share a clustered filesystem (GFS). All nodes run Dovecot 2.0.18. .. Mar 14 20:40:38 imapdir2 dovecot: lmtp(10692): Panic: file lmtp-proxy.c: line 376 (lmtp_proxy_output_timeout): assertion failed: (proxy->data_input ->eof) I pretty much rewrote the LMTP proxying code in v2.1, so there's a very good chance that it's already been fixed.
I'll give it a shot. For the purposes of doing a rolling upgrade, is it reasonable to expect a 2.0.18 director to peer with a 2.1.1 director for the duration, or should I split-brain them during the upgrade?
Jim
On Thu, 2012-03-15 at 07:50 -0400, Jim Lawson wrote:
On 3/15/12 6:02 AM, Timo Sirainen wrote:
Hi,
On 15.3.2012, at 3.24, Jim Lawson wrote:
We have a 2-node director setup which front-ends for 4 nodes which share a clustered filesystem (GFS). All nodes run Dovecot 2.0.18. .. Mar 14 20:40:38 imapdir2 dovecot: lmtp(10692): Panic: file lmtp-proxy.c: line 376 (lmtp_proxy_output_timeout): assertion failed: (proxy->data_input ->eof) I pretty much rewrote the LMTP proxying code in v2.1, so there's a very good chance that it's already been fixed.
I'll give it a shot. For the purposes of doing a rolling upgrade, is it reasonable to expect a 2.0.18 director to peer with a 2.1.1 director for the duration, or should I split-brain them during the upgrade?
I'm almost certain that v2.1.1 talks compatible protocol with v2.0. The current hg version has some extra features, but it doesn't use them until all of the directors have upgraded to the new version.
On 3/15/12 8:25 AM, Timo Sirainen wrote:
On Thu, 2012-03-15 at 07:50 -0400, Jim Lawson wrote:
On 3/15/12 6:02 AM, Timo Sirainen wrote:
Hi,
On 15.3.2012, at 3.24, Jim Lawson wrote:
We have a 2-node director setup which front-ends for 4 nodes which share a clustered filesystem (GFS). All nodes run Dovecot 2.0.18. .. Mar 14 20:40:38 imapdir2 dovecot: lmtp(10692): Panic: file lmtp-proxy.c: line 376 (lmtp_proxy_output_timeout): assertion failed: (proxy->data_input ->eof) I pretty much rewrote the LMTP proxying code in v2.1, so there's a very good chance that it's already been fixed.
I'll give it a shot. For the purposes of doing a rolling upgrade, is it reasonable to expect a 2.0.18 director to peer with a 2.1.1 director for the duration, or should I split-brain them during the upgrade? I'm almost certain that v2.1.1 talks compatible protocol with v2.0. The current hg version has some extra features, but it doesn't use them until all of the directors have upgraded to the new version.
Trying with v2.1.2 (peer is v2.0.18):
Mar 15 13:15:53 imapdir2 dovecot: director: Panic: file director.c: line 295 (director_sync): assertion failed: (!dir->ring_synced || (dir->left == NULL && dir->right == NULL)) Mar 15 13:15:53 imapdir2 dovecot: director: Fatal: master: service(director): child 513 killed with signal 6 (core not dumped) Mar 15 13:15:53 imapdir2 dovecot: director: Error: Director 132.198.100.149:9090/right disconnected
Which is OK, I can run them split-brained (rules in iptables to prevent directors from talking) while I move users around. It'll mean poor performance for GFS for the duration, but that's better than an outage.
The good news is, the lmtp problem I wrote about above appears to be fixed. Thanks !!!
Jim
On 15.3.2012, at 19.23, Jim Lawson wrote:
I'm almost certain that v2.1.1 talks compatible protocol with v2.0. The current hg version has some extra features, but it doesn't use them until all of the directors have upgraded to the new version.
Trying with v2.1.2 (peer is v2.0.18):
Mar 15 13:15:53 imapdir2 dovecot: director: Panic: file director.c: line 295 (director_sync): assertion failed: (!dir->ring_synced || (dir->left == NULL && dir->right == NULL))
This points to a more generic problem. How did this happen? You have two directors, stopped & upgraded one, started it up and it crashed?
On 3/15/12 1:52 PM, Timo Sirainen wrote:
On 15.3.2012, at 19.23, Jim Lawson wrote:
I'm almost certain that v2.1.1 talks compatible protocol with v2.0. The current hg version has some extra features, but it doesn't use them until all of the directors have upgraded to the new version.
Trying with v2.1.2 (peer is v2.0.18):
Mar 15 13:15:53 imapdir2 dovecot: director: Panic: file director.c: line 295 (director_sync): assertion failed: (!dir->ring_synced || (dir->left == NULL && dir->right == NULL)) This points to a more generic problem. How did this happen? You have two directors, stopped & upgraded one, started it up and it crashed?
That's correct. Configs are the same between directors (same as I sent in the original msg)
Jim
participants (2)
-
Jim Lawson
-
Timo Sirainen