[Dovecot] imap-login crashes after upgrade to Dovecot 1.2.8 Linux 64-bit
Hi all,
We moved our Dovecot installation first of all to have all folders and inboxes on NetApps via NFS (with indexes local) then from a physical Solaris 8 4-way UltraSparc server running Dovecot 1.2.5 (32-bit) to a virtualised Ubuntu 8.04 64-bit server with a single virtual processor (in VMWare ESX 3.5) running Dovecot 1.2.8 (64-bit).
Since the change, many of our users have experienced random disconnections when proxied in imap-login (either using the internal SSL proxy, or to an external IMAP server) and the error logs contain things like:
Dec 18 00:27:58 imap-login: Info: Disconnected: Connection queue full (auth failed, 1 attempts): user=<user0>, method=PLAIN, rip=<ip0>, lip=134.225.32.156, TLS Dec 18 00:27:58 imap-login: Panic: file ioloop.c: line 39 (io_add): assertion failed: (fd >= 0) Dec 18 00:27:58 IMAP 9486 <user1> <ip1> : Info: Connection closed bytes=224/1753 Dec 18 00:27:58 IMAP 10116 <user2> <ip2> : Info: Connection closed bytes=82/707
...
Dec 18 00:27:58 IMAP 10296 <user3> <ip3> : Info: Disconnected in IDLE bytes=44/435 Dec 18 00:27:58 IMAP 10298 <user3> <ip3> : Info: Disconnected in IDLE bytes=263/1662 Dec 18 00:27:58 dovecot: Error: child 5124 (login) killed with signal 6 (core dumped) (latest ip=<ip0>) Dec 18 00:27:58 imap-login: Info: Aborted login (no auth attempts): rip=<ip0>, lip=134.225.32.156, TLS: Disconnected
and a backtrace gives:
(gdb) bt #0 0x00007ff1030ad095 in raise () from /lib/libc.so.6 #1 0x00007ff1030aeaf0 in abort () from /lib/libc.so.6 #2 0x00000000004112d5 in default_fatal_finish (type=<value optimized out>, status=0) at failures.c:160 #3 0x0000000000411333 in i_internal_fatal_handler (type=LOG_TYPE_PANIC, status=0, fmt=<value optimized out>, args=<value optimized out>) at failures.c:443 #4 0x0000000000410996 in i_panic (format=0x6 <Address 0x6 out of bounds>) at failures.c:207 #5 0x000000000041420e in io_add (fd=-1, condition=IO_READ, callback=0x404db0 <client_input>, context=0x6af000) at ioloop.c:39 #6 0x00000000004059db in client_auth_failed (client=0x6af000, nodelay=true) at client-authenticate.c:103 #7 0x0000000000405e07 in client_handle_args (client=0x6af000, args=<value optimized out>, success=true, nodelay_r=0x7fff571009af) at client-authenticate.c:198 #8 0x0000000000406214 in sasl_callback (_client=0x6af000, reply=SASL_SERVER_REPLY_SUCCESS, data=0x0, args=0x6361d0) at client-authenticate.c:277 #9 0x000000000040eb13 in auth_client_input_ok (conn=0x75b6c8, args=<value optimized out>) at auth-server-request.c:196 #10 0x000000000040dbf3 in auth_client_input (conn=0x75b6c8) at auth-server-connection.c:136 #11 0x0000000000414aa8 in io_loop_handler_run (ioloop=<value optimized out>) at ioloop-epoll.c:208 #12 0x0000000000413b9d in io_loop_run (ioloop=0x6552b0) at ioloop.c:335 #13 0x0000000000408e81 in main (argc=2, argv=0x7fff57100c58, envp=0x7fff57100c70) at main.c:494
Up until this afternoon we had "login_process_per_connection = yes", and login_max_connections at the default (256).
This evening I tried with "login_process_per_connection = no" with no problems, except more CPU load and needing to increase login_max_processes_count (and strictly speaking the max fds set with ulimit -n would need to be increased, though I didn't do that as it would have involved killing and restarting the master process - come back Solaris plimit!)
Just after midnight, I switched back to "login_process_per_connection = no" and set "login_max_connections = 32" and managed to get core dumps enabled, hence the above!
Apart from this problem, the virtual machine has been doing extremely well, so it's not CPU or I/O load related.
I'm not sure whether the problem is caused by the switch from Dovecot 1.2.5 to 1.2.8, Solaris to Linux, physical to virtual, or 32-bit to 64-bit!
Perhaps we changed too much at once, but it did mean we kept the service going while power was turned off in our main machine room last weekend ...
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On 18/12/2009 01:01, Chris Wakelin wrote:
Up until this afternoon we had "login_process_per_connection = yes", and login_max_connections at the default (256).
I meant, of course, "login_process_per_connection = no" ...
This evening I tried with "login_process_per_connection = no" with no problems, except more CPU load and needing to increase
... and "login_process_per_connection = yes" - it's too late at night :(
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Dec 17, 2009, at 8:01 PM, Chris Wakelin wrote:
Dec 18 00:27:58 imap-login: Info: Disconnected: Connection queue full (auth failed, 1 attempts): user=<user0>, method=PLAIN, rip=<ip0>, lip=134.225.32.156, TLS Dec 18 00:27:58 imap-login: Panic: file ioloop.c: line 39 (io_add): assertion failed: (fd >= 0)
I guess the crash is caused by the "connection queue full". I'll see about fixing the crash, but you shouldn't let it get full in any case. So increase your login_max_connections value.
On 18/12/2009 01:11, Timo Sirainen wrote:
On Dec 17, 2009, at 8:01 PM, Chris Wakelin wrote:
Dec 18 00:27:58 imap-login: Info: Disconnected: Connection queue full (auth failed, 1 attempts): user=<user0>, method=PLAIN, rip=<ip0>, lip=134.225.32.156, TLS Dec 18 00:27:58 imap-login: Panic: file ioloop.c: line 39 (io_add): assertion failed: (fd >= 0)
I guess the crash is caused by the "connection queue full". I'll see about fixing the crash, but you shouldn't let it get full in any case. So increase your login_max_connections value.
That's puzzling, I thought it was supposed to fire up another imap-login process (up to login_max_processes_count)? We have something in the region of 1000 concurrent connections and login_max_processes_count * login_max_connections = 128 * 256 should be enough for 32000+ of them!
Would it be better to have login_max_connections = 1024 (or even higher)?
I'm also confused as to why we never saw the problem on the Solaris box (with more connections as it was during term time).
At the moment I'm running with "login_process_per_connection = yes" but I'm a little concerned about the increased CPU load (I could probably add more virtual CPUs, but there are apparently some issues with 64-lit Linux on multiple CPUs in VMWare ESX 3.5.x).
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Fri, 2009-12-18 at 14:56 +0000, Chris Wakelin wrote:
I guess the crash is caused by the "connection queue full". I'll see about fixing the crash, but you shouldn't let it get full in any case. So increase your login_max_connections value.
That's puzzling, I thought it was supposed to fire up another imap-login process (up to login_max_processes_count)?
It's supposed to, yes, but it's been buggy for a long time and I haven't figured out how to fix it. v2.0 should finally fix it..
We have something in the region of 1000 concurrent connections and login_max_processes_count * login_max_connections = 128 * 256 should be enough for 32000+ of them!
Would it be better to have login_max_connections = 1024 (or even higher)?
Yes. If you have only one CPU/core you shouldn't have more than 2 login processes.
I'm also confused as to why we never saw the problem on the Solaris box (with more connections as it was during term time).
It was probably doing a better job than Linux of distributing connections to processes.
On 18/12/2009 15:08, Timo Sirainen wrote:
We have something in the region of 1000 concurrent connections and login_max_processes_count * login_max_connections = 128 * 256 should be enough for 32000+ of them!
Would it be better to have login_max_connections = 1024 (or even higher)?
Yes. If you have only one CPU/core you shouldn't have more than 2 login processes.
I'm also confused as to why we never saw the problem on the Solaris box (with more connections as it was during term time).
It was probably doing a better job than Linux of distributing connections to processes.
Ah! That makes sense, as the Solaris box had four processors and 4 * 256 would have probably been enough most of the time. It wouldn't surprise me if the Solaris kernel is better at this as well.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Fri, 2009-12-18 at 01:01 +0000, Chris Wakelin wrote:
Dec 18 00:27:58 imap-login: Info: Disconnected: Connection queue full (auth failed, 1 attempts): user=<user0>, method=PLAIN, rip=<ip0>, lip=134.225.32.156, TLS Dec 18 00:27:58 imap-login: Panic: file ioloop.c: line 39 (io_add): assertion failed: (fd >= 0)
BTW. This crash is now also fixed: http://hg.dovecot.org/dovecot-1.2/rev/d1548d794f72
participants (2)
-
Chris Wakelin
-
Timo Sirainen