[Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

newer
Re: [Dovecot] Dovecot - postfix...

older
[Dovecot] Need to detect event: a...

Chris Wakelin

22 Sep 2006 22 Sep '06

1:24 p.m.

Anybody else seen this one? We're running Dovecot 1.0.rc7 on a different port on our live server (we're generally live with 1.0.beta5) and after a HUP to reload the user database (in a passwd-file) this morning we got:-

...

dovecot: Sep 22 07:45:05 Info: IMAP 14482 someuser xxx.xxx.xxx.xxx : Disconnected in IDLE dovecot: Sep 22 07:45:05 Warning: imap-login: SSL_read() syscall failed: No route to host [xxx.xxx.xxx.xxx] dovecot: Sep 22 07:45:05 Warning: SIGHUP received - reloading configuration dovecot: Sep 22 07:45:05 Error: Login process died too early - shutting down dovecot: Sep 22 07:45:05 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

and Dovecot had died. "someuser" was me, and I'd left my Thunderbird client logged in (via IMAPS) by mistake when I hibernated my PC last night.

I'd seen exactly the same assert on a test box the day before yesterday, (this time without a machine hibernating):-

...

dovecot: Sep 19 19:00:45 Info: imap-login: Aborted login: user=<testuser>, method=PLAIN, rip=yyy.yyy.yyy.yyy, lip=yyy.yyy.yyy.yyy, secured dovecot: Sep 19 19:07:42 Warning: SIGHUP received - reloading configuration dovecot: Sep 19 19:07:42 Error: Login process died too early - shutting down dovecot: Sep 19 19:07:42 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

Where yyy.yyy.yyy.yyy is actually the test box (I was logging in from the server itself using "openssl s_client").

The thing is, I was hoping to upgrade everyone (~20,000 users) to 1.0.rc7 next Tuesday, but now I'm not so sure!

Chris

-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094

Show replies by date

Chris Wakelin

25 Sep 25 Sep

1:16 p.m.

New subject: [Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

Chris Wakelin wrote:

...

Anybody else seen this one? We're running Dovecot 1.0.rc7 on a different port on our live server (we're generally live with 1.0.beta5) and after a HUP to reload the user database (in a passwd-file) this morning we got:-

...
dovecot: Sep 22 07:45:05 Info: IMAP 14482 someuser xxx.xxx.xxx.xxx : Disconnected in IDLE dovecot: Sep 22 07:45:05 Warning: imap-login: SSL_read() syscall failed: No route to host [xxx.xxx.xxx.xxx] dovecot: Sep 22 07:45:05 Warning: SIGHUP received - reloading configuration dovecot: Sep 22 07:45:05 Error: Login process died too early - shutting down dovecot: Sep 22 07:45:05 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

and Dovecot had died. "someuser" was me, and I'd left my Thunderbird client logged in (via IMAPS) by mistake when I hibernated my PC last night.

I'd seen exactly the same assert on a test box the day before yesterday, (this time without a machine hibernating):-

...
dovecot: Sep 19 19:00:45 Info: imap-login: Aborted login: user=<testuser>, method=PLAIN, rip=yyy.yyy.yyy.yyy, lip=yyy.yyy.yyy.yyy, secured dovecot: Sep 19 19:07:42 Warning: SIGHUP received - reloading configuration dovecot: Sep 19 19:07:42 Error: Login process died too early - shutting down dovecot: Sep 19 19:07:42 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

Where yyy.yyy.yyy.yyy is actually the test box (I was logging in from the server itself using "openssl s_client").

The thing is, I was hoping to upgrade everyone (~20,000 users) to 1.0.rc7 next Tuesday, but now I'm not so sure!

Chris

I've had several goes at trying to cause this "on-demand", but I can only get it to fail occasionally. My best recipe so far is to

Login via SSL and open INBOX (I'm using "openssl s_client" as my "debug" IMAP client).
IDLE then break the connection (with CTRL-C)
Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching
Send SIGHUP to the Dovecot master process.

This fails sometimes:-

dovecot: Sep 25 10:38:06 Info: IMAP:6196:someuser:yyy.yyy.yyy.yyy: Disconnected in IDLE dovecot: Sep 25 10:38:11 Warning: SIGHUP received - reloading configuration dovecot: Sep 25 10:38:12 Error: invalid I/O fd 29, callback 17d2c dovecot: Sep 25 10:38:12 Error: login: fd_read() failed: Resource temporarily unavailable dovecot: Sep 25 10:38:12 Error: Login process died too early - shutting down dovecot: Sep 25 10:38:13 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

I once managed to get an "invalid I/O fd" without the crash as well, just to make things more interesting :)

I'm guessing it's some sort of race condition on a login-process ending (using SSL forces it to persist while it proxies the IMAP connection) and the forcing of a (slow?) reload of the userdb. It looks like the assertion failure is caused by the login process dying, rather than the other way round.

I should have said these are Solaris 8 (sparc) boxes, and we have "login_process_per_connection = no"

Our upgrade to 1.0rc7 is on ice for now, until we get this sorted out. (We're actually on 1.0beta7 on the "live" service, not beta5 as I said above.)

Best Wishes, Chris

Timo Sirainen

1:26 p.m.

New subject: [Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

On Mon, 2006-09-25 at 11:16 +0100, Chris Wakelin wrote:

...

Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching

Or you could just "touch userdb".

...

dovecot: Sep 25 10:38:11 Warning: SIGHUP received - reloading configuration dovecot: Sep 25 10:38:12 Error: invalid I/O fd 29, callback 17d2c

Could you do:

gdb dovecot x 0x17d2c

...

dovecot: Sep 25 10:38:12 Error: login: fd_read() failed: Resource temporarily unavailable

Hmm. I guess this could be fixed by simply returning from the function if it returns EAGAIN..

...

dovecot: Sep 25 10:38:12 Error: Login process died too early - shutting down dovecot: Sep 25 10:38:13 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

Could you get gdb backtrace from this? Since it's dovecot master process, it should write core to /var/run/dovecot/ (assuming you had ulimit -c high enough).

...

Our upgrade to 1.0rc7 is on ice for now, until we get this sorted out. (We're actually on 1.0beta7 on the "live" service, not beta5 as I said above.)

I don't think I've broken anything related to this since beta7, but it might work a bit differently so that you just don't see this problem with it..

Chris Wakelin

1:45 p.m.

New subject: [Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

Timo Sirainen wrote:

...

On Mon, 2006-09-25 at 11:16 +0100, Chris Wakelin wrote:

...

Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching

Or you could just "touch userdb".

Probably doesn't prevent disk caching, I thought. Though just "cp userdb.temp userdb" should work!

...

...
dovecot: Sep 25 10:38:11 Warning: SIGHUP received - reloading configuration dovecot: Sep 25 10:38:12 Error: invalid I/O fd 29, callback 17d2c

Could you do:

gdb dovecot x 0x17d2c

(gdb) x 0x17d2c 0x17d2c <login_process_input>: 0x9de3bf50 (gdb)

...

...
dovecot: Sep 25 10:38:12 Error: login: fd_read() failed: Resource temporarily unavailable

Hmm. I guess this could be fixed by simply returning from the function if it returns EAGAIN..

...
dovecot: Sep 25 10:38:12 Error: Login process died too early - shutting down dovecot: Sep 25 10:38:13 Panic: file ioloop-poll.c: line 105 (io_loop_handle_remove): assertion failed: (index >= 0 && (unsigned int) index < ctx->fds_count)

Could you get gdb backtrace from this? Since it's dovecot master process, it should write core to /var/run/dovecot/ (assuming you had ulimit -c high enough).

... Program terminated with signal 6, Abort. ... Reading symbols from /usr/platform/SUNW,Ultra-250/lib/libc_psr.so.1...done. Loaded symbols for /usr/platform/SUNW,Ultra-250/lib/libc_psr.so.1

#0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 (gdb) bt #0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff1b598c in abort () from /usr/lib/libc.so.1 #2 0x1f874 in default_fatal_handler (status=196248, format=0xffbef5e8 "", args=0x1f84c) at failures.c:120

The backtrace for the live server was very similar.

I'm concerned though, that the crash on the test version on the live server and the first crash on the test server didn't involve fd_read, so perhaps there's more than one issue here?

Is there a way to get imap-login to dump core when it dies? (We're running with the default login_chroot=yes, if that makes any difference.)

Best Wishes, Chris

Timo Sirainen

3:12 p.m.

New subject: [Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

On Mon, 2006-09-25 at 11:45 +0100, Chris Wakelin wrote:

...

Timo Sirainen wrote:

...
On Mon, 2006-09-25 at 11:16 +0100, Chris Wakelin wrote:

...

Replace the userdb passwd-file (mv userdb userdb.temp;cp userdb.temp userdb) to clobber any caching

Or you could just "touch userdb".

Probably doesn't prevent disk caching, I thought. Though just "cp userdb.temp userdb" should work!

Disk caching? Kernel caches the file in both cases anyway.

...

#0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 (gdb) bt #0 0xff21fbe8 in _libc_kill () from /usr/lib/libc.so.1 #1 0xff1b598c in abort () from /usr/lib/libc.so.1 #2 0x1f874 in default_fatal_handler (status=196248, format=0xffbef5e8 "", args=0x1f84c) at failures.c:120

This unfortunately didn't help, since it didn't show what called io_remove()..

Anyway, I didn't find any reason why these errors could happen from the code, so I guess it was caused by memory corruption on SIGHUP which I just fixed:

http://dovecot.org/list/dovecot-cvs/2006-September/006420.html

...

Is there a way to get imap-login to dump core when it dies? (We're running with the default login_chroot=yes, if that makes any difference.)

Unfortunately it's not really possible without setting login_chroot=no and also running all dovecot processes under a single uid (including the master). Maybe I should add some login_insecure_debug-setting which would drop privileges before execing, make login directory writable by dovecot and disable chrooting..

Timo Sirainen

3:15 p.m.

New subject: [Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

On Mon, 2006-09-25 at 15:12 +0300, Timo Sirainen wrote:

...

Anyway, I didn't find any reason why these errors could happen from the code, so I guess it was caused by memory corruption on SIGHUP which I just fixed:

http://dovecot.org/list/dovecot-cvs/2006-September/006420.html

Oops, forgot some debugging code in there. Removed in this patch:

http://dovecot.org/list/dovecot-cvs/2006-September/006422.html

Chris Wakelin

6:56 p.m.

New subject: [Dovecot] Dovecot 1.0.rc7 ioloop-poll.c assertion failed after SIGHUP

Timo Sirainen wrote:

...

Anyway, I didn't find any reason why these errors could happen from the code, so I guess it was caused by memory corruption on SIGHUP which I just fixed:

http://dovecot.org/list/dovecot-cvs/2006-September/006420.html

Thanks Timo, I've applied the patch and haven't managed to make it crash since. But it was tricky to get it to crash in the first place - I hate intermittent problems :(

Best Wishes, Chris

6894

Age (days ago)

6897

Last active (days ago)

List overview

6 comments

2 participants

participants (2)

Chris Wakelin
Timo Sirainen