[Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP
Anybody else seen this one? We're running Dovecot 1.0.rc7 on a different port on our live server (we're generally live with 1.0.beta5) and after a HUP to reload the user database (in a passwd-file) this morning we got:-
dovecot: Sep 22 07:45:05 Info: IMAP 14482 someuser xxx.xxx.xxx.xxx : Disconnected in IDLE dovecot: Sep 22 07:45:05 Warning: imap-login: SSL_read() syscall failed: No route to host [xxx.xxx.xxx.xxx] dovecot: Sep 22 07:45:05 Warning: SIGHUP received - reloading configuration dovecot: Sep 22 07:45:05 Error: Login process died too early - shutting down dovecot: Sep 22 07:45:05 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
and Dovecot had died. "someuser" was me, and I'd left my Thunderbird client logged in (via IMAPS) by mistake when I hibernated my PC last night.
I'd seen exactly the same assert on a test box the day before yesterday, (this time without a machine hibernating):-
dovecot: Sep 19 19:00:45 Info: imap-login: Aborted login: user=<testuser>, method=PLAIN, rip=yyy.yyy.yyy.yyy, lip=yyy.yyy.yyy.yyy, secured dovecot: Sep 19 19:07:42 Warning: SIGHUP received - reloading configuration dovecot: Sep 19 19:07:42 Error: Login process died too early - shutting down dovecot: Sep 19 19:07:42 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
Where yyy.yyy.yyy.yyy is actually the test box (I was logging in from the server itself using "openssl s_client").
The thing is, I was hoping to upgrade everyone (~20,000 users) to 1.0.rc7 next Tuesday, but now I'm not so sure!
Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
Anybody else seen this one? We're running Dovecot 1.0.rc7 on a different port on our live server (we're generally live with 1.0.beta5) and after a HUP to reload the user database (in a passwd-file) this morning we got:-
dovecot: Sep 22 07:45:05 Info: IMAP 14482 someuser xxx.xxx.xxx.xxx : Disconnected in IDLE dovecot: Sep 22 07:45:05 Warning: imap-login: SSL_read() syscall failed: No route to host [xxx.xxx.xxx.xxx] dovecot: Sep 22 07:45:05 Warning: SIGHUP received - reloading configuration dovecot: Sep 22 07:45:05 Error: Login process died too early - shutting down dovecot: Sep 22 07:45:05 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
and Dovecot had died. "someuser" was me, and I'd left my Thunderbird client logged in (via IMAPS) by mistake when I hibernated my PC last night.
I'd seen exactly the same assert on a test box the day before yesterday, (this time without a machine hibernating):-
dovecot: Sep 19 19:00:45 Info: imap-login: Aborted login: user=<testuser>, method=PLAIN, rip=yyy.yyy.yyy.yyy, lip=yyy.yyy.yyy.yyy, secured dovecot: Sep 19 19:07:42 Warning: SIGHUP received - reloading configuration dovecot: Sep 19 19:07:42 Error: Login process died too early - shutting down dovecot: Sep 19 19:07:42 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
Where yyy.yyy.yyy.yyy is actually the test box (I was logging in from the server itself using "openssl s_client").
The thing is, I was hoping to upgrade everyone (~20,000 users) to 1.0.rc7 next Tuesday, but now I'm not so sure!
Chris
I've had several goes at trying to cause this "on-demand", but I can only get it to fail occasionally. My best recipe so far is to
- Login via SSL and open INBOX (I'm using "openssl s_client" as my "debug" IMAP client).
- IDLE then break the connection (with CTRL-C)
- Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching
- Send SIGHUP to the Dovecot master process.
This fails sometimes:-
dovecot: Sep 25 10:38:06 Info: IMAP:6196:someuser:yyy.yyy.yyy.yyy: Disconnected in IDLE dovecot: Sep 25 10:38:11 Warning: SIGHUP received - reloading configuration dovecot: Sep 25 10:38:12 Error: invalid I/O fd 29, callback 17d2c dovecot: Sep 25 10:38:12 Error: login: fd_read() failed: Resource temporarily unavailable dovecot: Sep 25 10:38:12 Error: Login process died too early - shutting down dovecot: Sep 25 10:38:13 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
I once managed to get an "invalid I/O fd" without the crash as well, just to make things more interesting :)
I'm guessing it's some sort of race condition on a login-process ending (using SSL forces it to persist while it proxies the IMAP connection) and the forcing of a (slow?) reload of the userdb. It looks like the assertion failure is caused by the login process dying, rather than the other way round.
I should have said these are Solaris 8 (sparc) boxes, and we have "login_process_per_connection = no"
Our upgrade to 1.0rc7 is on ice for now, until we get this sorted out. (We're actually on 1.0beta7 on the "live" service, not beta5 as I said above.)
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Mon, 2006-09-25 at 11:16 +0100, Chris Wakelin wrote:
- Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching
Or you could just "touch userdb".
dovecot: Sep 25 10:38:11 Warning: SIGHUP received - reloading configuration dovecot: Sep 25 10:38:12 Error: invalid I/O fd 29, callback 17d2c
Could you do:
gdb dovecot x 0x17d2c
dovecot: Sep 25 10:38:12 Error: login: fd_read() failed: Resource temporarily unavailable
Hmm. I guess this could be fixed by simply returning from the function if it returns EAGAIN..
dovecot: Sep 25 10:38:12 Error: Login process died too early - shutting down dovecot: Sep 25 10:38:13 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
Could you get gdb backtrace from this? Since it's dovecot master process, it should write core to /var/run/dovecot/ (assuming you had ulimit -c high enough).
Our upgrade to 1.0rc7 is on ice for now, until we get this sorted out. (We're actually on 1.0beta7 on the "live" service, not beta5 as I said above.)
I don't think I've broken anything related to this since beta7, but it might work a bit differently so that you just don't see this problem with it..
Timo Sirainen wrote:
On Mon, 2006-09-25 at 11:16 +0100, Chris Wakelin wrote:
- Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching
Or you could just "touch userdb".
Probably doesn't prevent disk caching, I thought. Though just "cp userdb.temp userdb" should work!
dovecot: Sep 25 10:38:11 Warning: SIGHUP received - reloading configuration dovecot: Sep 25 10:38:12 Error: invalid I/O fd 29, callback 17d2c
Could you do:
gdb dovecot x 0x17d2c
(gdb) x 0x17d2c
0x17d2c
dovecot: Sep 25 10:38:12 Error: login: fd_read() failed: Resource temporarily unavailable
Hmm. I guess this could be fixed by simply returning from the function if it returns EAGAIN..
dovecot: Sep 25 10:38:12 Error: Login process died too early - shutting down dovecot: Sep 25 10:38:13 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)
Could you get gdb backtrace from this? Since it's dovecot master process, it should write core to /var/run/dovecot/ (assuming you had ulimit -c high enough).
... Program terminated with signal 6, Abort. ... Reading symbols from /usr/platform/SUNW,Ultra-250/lib/libc_psr.so.1...done. Loaded symbols for /usr/platform/SUNW,Ultra-250/lib/libc_psr.so.1
#0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 (gdb) bt #0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff1b598c in abort () from /usr/lib/libc.so.1 #2 0x1f874 in default_fatal_handler (status=196248, format=0xffbef5e8 "", args=0x1f84c) at failures.c:120
The backtrace for the live server was very similar.
I'm concerned though, that the crash on the test version on the live server and the first crash on the test server didn't involve fd_read, so perhaps there's more than one issue here?
Is there a way to get imap-login to dump core when it dies? (We're running with the default login_chroot=yes, if that makes any difference.)
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Mon, 2006-09-25 at 11:45 +0100, Chris Wakelin wrote:
Timo Sirainen wrote:
On Mon, 2006-09-25 at 11:16 +0100, Chris Wakelin wrote:
- Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching
Or you could just "touch userdb".
Probably doesn't prevent disk caching, I thought. Though just "cp userdb.temp userdb" should work!
Disk caching? Kernel caches the file in both cases anyway.
#0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 (gdb) bt #0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff1b598c in abort () from /usr/lib/libc.so.1 #2 0x1f874 in default_fatal_handler (status=196248, format=0xffbef5e8 "", args=0x1f84c) at failures.c:120
This unfortunately didn't help, since it didn't show what called io_remove()..
Anyway, I didn't find any reason why these errors could happen from the code, so I guess it was caused by memory corruption on SIGHUP which I just fixed:
http://dovecot.org/list/dovecot-cvs/2006-September/006420.html
Is there a way to get imap-login to dump core when it dies? (We're running with the default login_chroot=yes, if that makes any difference.)
Unfortunately it's not really possible without setting login_chroot=no and also running all dovecot processes under a single uid (including the master). Maybe I should add some login_insecure_debug-setting which would drop privileges before execing, make login directory writable by dovecot and disable chrooting..
On Mon, 2006-09-25 at 15:12 +0300, Timo Sirainen wrote:
Anyway, I didn't find any reason why these errors could happen from the code, so I guess it was caused by memory corruption on SIGHUP which I just fixed:
http://dovecot.org/list/dovecot-cvs/2006-September/006420.html
Oops, forgot some debugging code in there. Removed in this patch:
http://dovecot.org/list/dovecot-cvs/2006-September/006422.html
Timo Sirainen wrote:
Anyway, I didn't find any reason why these errors could happen from the code, so I guess it was caused by memory corruption on SIGHUP which I just fixed:
http://dovecot.org/list/dovecot-cvs/2006-September/006420.html
Thanks Timo, I've applied the patch and haven't managed to make it crash since. But it was tricky to get it to crash in the first place - I hate intermittent problems :(
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
participants (2)
-
Chris Wakelin
-
Timo Sirainen