We upgraded from 1.0-beta7 to 1.0-rc17 on Tuesday evening and looking at today's logs, just four imap processes out of 90,000+ died. I managed to get core files (the server is running Solaris 8 on Sparc, and Dovecot was compiled with gcc 3.3.2). I hope they're useful!
(I should also say that performance between the two versions is roughly similar according to Solaris' process accounting stats, looking at CPU usage and disk blocks accessed.)
Best Wishes, Chris
- Here's the last log entry for each
dovecot: Jan 11 11:39:59 Info: IMAP 5751 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (18 msgs) dovecot: Jan 11 11:39:59 Error: child 5751 (imap) killed with signal 11
dovecot: Jan 11 12:28:49 Info: IMAP 12765 xxxxxxxx xxx.xxx.xxx.xxx : Disconnected: Internal error occurred. Refer to server log for more information. [2007-01-11 12:28:49] dovecot: Jan 11 12:28:49 Error: child 12765 (imap) killed with signal 10
dovecot: Jan 11 13:01:49 Info: IMAP 14234 xxxxxxxx xxx.xxx.xxx.xxx : Opened List Reading (124 msgs) dovecot: Jan 11 13:01:54 Error: child 14234 (imap) killed with signal 10
dovecot: Jan 11 19:13:55 Info: IMAP 18786 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (45 msgs) dovecot: Jan 11 19:13:57 Error: child 18786 (imap) killed with signal 11
(I have extra logging lines in cmd-select.c, hence the "Opened" lines)
- backtraces!
pid 5751
#0 0xff1c2838 in t_delete () from /usr/lib/libc.so.1 #1 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 #2 0xff1c1f48 in _malloc_unlocked () from /usr/lib/libc.so.1 #3 0xff1c1ce0 in malloc () from /usr/lib/libc.so.1 #4 0xff1b617c in calloc () from /usr/lib/libc.so.1 #5 0x89c18 in pool_system_malloc (pool=0xb66f0, size=156) at mempool-system.c:67 #6 0x84ab8 in i_malloc (size=156) at imem.c:9 #7 0x43b2c in mbox_md5_init () at mbox-md5.c:94 #8 0x3df20 in mbox_sync_parse_match_mail (mbox=0xcee58, view=0xcdde8, seq=2) at mbox-sync-parse.c:557 #9 0x3966c in mbox_file_seek (mbox=0xcee58, view=0xcdde8, seq=2, deleted_r=0xffbeee47) at mbox-file.c:195 #10 0x3b628 in mbox_mail_seek (mail=0xd0bf8) at mbox-mail.c:69 #11 0x3b988 in mbox_mail_get_stream (_mail=0xd0bf8, hdr_size=0x0, body_size=0x0) at mbox-mail.c:184 #12 0x70458 in mail_get_stream (mail=0x0, hdr_size=0x0, body_size=0x0) at mail.c:107 #13 0x5321c in index_mail_get_header_stream (_mail=0xd0bf8, _headers=0xd1368) at index-mail-headers.c:638 #14 0x70434 in mail_get_header_stream (mail=0x0, headers=0xd1368) at mail.c:98 #15 0x529ec in index_mail_headers_get_envelope (mail=0xd0bf8) at index-mail-headers.c:408 #16 0x515c8 in index_mail_get_special (_mail=0xd0bf8, field=MAIL_FETCH_IMAP_ENVELOPE) at index-mail.c:706 #17 0x70474 in mail_get_special (mail=0xd0bf8, field=MAIL_FETCH_IMAP_ENVELOPE) at mail.c:114 #18 0x254d0 in fetch_envelope (ctx=0xd0bf8, mail=0xd0bf8, context=0x0) at imap-fetch.c:450 #19 0x24e80 in imap_fetch (ctx=0xc4a58) at imap-fetch.c:291 #20 0x1f62c in cmd_fetch (cmd=0xc27a4) at cmd-fetch.c:163 #21 0x23334 in client_handle_input (cmd=0xc27a4) at client.c:331 #22 0x232b4 in client_handle_input (cmd=0xc27a4) at client.c:388 #23 0x23430 in _client_input (context=0xc2760) at client.c:428 #24 0x87924 in io_loop_handler_run (ioloop=0xbf838) at ioloop-poll.c:199 #25 0x87184 in io_loop_run (ioloop=0xbf838) at ioloop.c:281 #26 0x2c658 in main (argc=0, argv=0xffbef5ec, envp=0xffbef5fc) at main.c:280
pid 12765
#0 0xff1c27f0 in t_delete () from /usr/lib/libc.so.1 #1 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 #2 0xff1c2c24 in _free_unlocked () from /usr/lib/libc.so.1 #3 0xff1c2b74 in free () from /usr/lib/libc.so.1 #4 0x5b550 in mail_index_free (_index=0xcac54) at mail-index.c:62 #5 0x552f0 in index_list_free (list=0xcac50) at index-storage.c:84 #6 0x555fc in destroy_unrefed (all=true) at index-storage.c:162 #7 0x379ac in mbox_free (_storage=0xc2598) at mbox-storage.c:369 #8 0x711a4 in mail_storage_destroy (_storage=0xc1a2c) at mail-storage.c:204 #9 0x2cbd8 in namespace_deinit (namespaces=0xc1a10) at namespace.c:156 #10 0x22a00 in client_destroy (client=0xc2760, reason=0x0) at client.c:90 #11 0x87924 in io_loop_handler_run (ioloop=0xbf838) at ioloop-poll.c:199 #12 0x87184 in io_loop_run (ioloop=0xbf838) at ioloop.c:281 #13 0x2c658 in main (argc=0, argv=0xffbef604, envp=0xffbef614) at main.c:280
pid 14234
#0 0xff1c2838 in t_delete () from /usr/lib/libc.so.1 #1 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 #2 0xff1c2cb8 in cleanfree () from /usr/lib/libc.so.1 #3 0xff1c1dec in _malloc_unlocked () from /usr/lib/libc.so.1 #4 0xff1c1ce0 in malloc () from /usr/lib/libc.so.1 #5 0xff1b617c in calloc () from /usr/lib/libc.so.1 #6 0x89c18 in pool_system_malloc (pool=0xb66f0, size=28) at mempool-system.c:67 #7 0x86d7c in timeout_add (msecs=1740000, callback=0x1f920 <idle_timeout>, context=0xc4a18) at ioloop.c:146 #8 0x1fda0 in cmd_idle (cmd=0xc27a4) at cmd-idle.c:245 #9 0x23334 in client_handle_input (cmd=0xc27a4) at client.c:331 #10 0x232b4 in client_handle_input (cmd=0xc27a4) at client.c:388 #11 0x23430 in _client_input (context=0xc2760) at client.c:428 #12 0x87924 in io_loop_handler_run (ioloop=0xbf838) at ioloop-poll.c:199 #13 0x87184 in io_loop_run (ioloop=0xbf838) at ioloop.c:281 #14 0x2c658 in main (argc=0, argv=0xffbef5fc, envp=0xffbef60c) at main.c:280
pid 18786
#0 0xff1c2974 in t_splay () from /usr/lib/libc.so.1 #1 0xff1c27e0 in t_delete () from /usr/lib/libc.so.1 #2 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 #3 0xff1c2cb8 in cleanfree () from /usr/lib/libc.so.1 #4 0xff1c1dec in _malloc_unlocked () from /usr/lib/libc.so.1 #5 0xff1c1ce0 in malloc () from /usr/lib/libc.so.1 #6 0xff1b617c in calloc () from /usr/lib/libc.so.1 #7 0x89c18 in pool_system_malloc (pool=0xb66f0, size=28) at mempool-system.c:67 #8 0x7fa58 in buffer_create_dynamic (pool=0xb66f0, init_size=20) at buffer.c:109 #9 0x56924 in index_transaction_init (t=0xcf090, ibox=0xd03a8, flags=0) at index-transaction.c:13 #10 0x38fa0 in mbox_transaction_begin (box=0xcf090, flags=0) at mbox-transaction.c:16 #11 0x71954 in mailbox_transaction_begin (box=0xcf090, flags=0) at mail-storage.c:497 #12 0x29f30 in imap_sync_init (client=0xc2760, box=0xd03a8, imap_flags=0, flags=28) at imap-sync.c:47 #13 0x2a534 in cmd_sync (cmd=0xc27a4, flags=28, imap_flags=0, tagline=0x93120 "OK Fetch completed.") at imap-sync.c:243 #14 0x1f470 in cmd_fetch_finish (ctx=0xc4a50) at cmd-fetch.c:103 #15 0x1f66c in cmd_fetch (cmd=0xc27a4) at cmd-fetch.c:173 #16 0x226b4 in cmd_uid (cmd=0xc27a4) at cmd-uid.c:19 #17 0x23334 in client_handle_input (cmd=0xc27a4) at client.c:331 #18 0x232b4 in client_handle_input (cmd=0xc27a4) at client.c:388 #19 0x23430 in _client_input (context=0xc2760) at client.c:428 #20 0x87924 in io_loop_handler_run (ioloop=0xbf838) at ioloop-poll.c:199 #21 0x87184 in io_loop_run (ioloop=0xbf838) at ioloop.c:281 #22 0x2c658 in main (argc=0, argv=0xffbef5ec, envp=0xffbef5fc) at main.c:280
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Thu, 2007-01-11 at 21:32 +0000, Chris Wakelin wrote:
dovecot: Jan 11 11:39:59 Info: IMAP 5751 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (18 msgs) dovecot: Jan 11 11:39:59 Error: child 5751 (imap) killed with signal 11
dovecot: Jan 11 12:28:49 Info: IMAP 12765 xxxxxxxx xxx.xxx.xxx.xxx : Disconnected: Internal error occurred. Refer to server log for more information. [2007-01-11 12:28:49] dovecot: Jan 11 12:28:49 Error: child 12765 (imap) killed with signal 10
Didn't it log the internal error?
#0 0xff1c2838 in t_delete () from /usr/lib/libc.so.1 #1 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 #2 0xff1c1f48 in _malloc_unlocked () from /usr/lib/libc.so.1 .. #0 0xff1c27f0 in t_delete () from /usr/lib/libc.so.1 #1 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 .. #0 0xff1c2838 in t_delete () from /usr/lib/libc.so.1 #1 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 .. #0 0xff1c2974 in t_splay () from /usr/lib/libc.so.1 #1 0xff1c27e0 in t_delete () from /usr/lib/libc.so.1 #2 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 ..
These are all from malloc/free, which means there's memory corruption, which is bad. :( It can't really be debugged from the backtraces, since the actual bug could have occurred long before it crashed.
Do these happen for different users or only for a couple of users? Do the crashed users have unusually large mailboxes?
Could you run Dovecot with this: http://dovecot.org/patches/debug/memdebug-bof.diff I don't think it'll affect performance much. Only the imap binary needs to be compiled with it.
There's also this patch: http://dovecot.org/patches/debug/memdebug-delayed.diff But it does hurt performance and probably makes Dovecot use a lot more memory, so it's useful only if you can run it only for a couple of users who have crashing problems (make mail_executable to be a script which checks the username and based on that executes different imap binary).
On Tue, 2007-01-16 at 17:23 +0200, Timo Sirainen wrote:
http://dovecot.org/patches/debug/memdebug-bof.diff http://dovecot.org/patches/debug/memdebug-delayed.diff
BTW. Those check different things. The first checks only buffer under/overflows, and the latter checks only double-free()s. So if one doesn't show anything, the other probably then will.
Also could you show me your Dovecot settings (dovecot -n)?
Timo Sirainen wrote:
On Thu, 2007-01-11 at 21:32 +0000, Chris Wakelin wrote:
dovecot: Jan 11 11:39:59 Info: IMAP 5751 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (18 msgs) dovecot: Jan 11 11:39:59 Error: child 5751 (imap) killed with signal 11 dovecot: Jan 11 12:28:49 Info: IMAP 12765 xxxxxxxx xxx.xxx.xxx.xxx : Disconnected: Internal error occurred. Refer to server log for more information. [2007-01-11 12:28:49] dovecot: Jan 11 12:28:49 Error: child 12765 (imap) killed with signal 10
Didn't it log the internal error?
Sorry, yes, it was: Corrupted Index file <path>/.imap/Gloria/dovecot.index: Record [2].uid=0
I've had four of these between 11th-15th January, all for different folders/users, not always "Record [2]" but all of them had "uid=0".
...
These are all from malloc/free, which means there's memory corruption, which is bad. :( It can't really be debugged from the backtraces, since the actual bug could have occurred long before it crashed.
Do these happen for different users or only for a couple of users? Do the crashed users have unusually large mailboxes?
They seem to happen only the once for 3 or 4 different users each day. I guess they're caused by corrupt indexes left over from the earlier version of Dovecot. I don't think the folders are particularly large. I'll look at restoring the relevant index files and folders from backup for some of the users (i.e. before they were fixed by the new Dovecot) and seeing if I can reproduce the crashes. We only backup some of the indexes though (those not stored on a separate partition).
Could you run Dovecot with this: http://dovecot.org/patches/debug/memdebug-bof.diff I don't think it'll affect performance much. Only the imap binary needs to be compiled with it.
There's also this patch: http://dovecot.org/patches/debug/memdebug-delayed.diff But it does hurt performance and probably makes Dovecot use a lot more memory, so it's useful only if you can run it only for a couple of users who have crashing problems (make mail_executable to be a script which checks the username and based on that executes different imap binary).
I'd rather not do these to the live server, but if I manage to reproduce the crashes as above, I'll certainly do it.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Tue, 2007-01-16 at 16:24 +0000, Chris Wakelin wrote:
Timo Sirainen wrote:
On Thu, 2007-01-11 at 21:32 +0000, Chris Wakelin wrote:
dovecot: Jan 11 11:39:59 Info: IMAP 5751 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (18 msgs) dovecot: Jan 11 11:39:59 Error: child 5751 (imap) killed with signal 11 dovecot: Jan 11 12:28:49 Info: IMAP 12765 xxxxxxxx xxx.xxx.xxx.xxx : Disconnected: Internal error occurred. Refer to server log for more information. [2007-01-11 12:28:49] dovecot: Jan 11 12:28:49 Error: child 12765 (imap) killed with signal 10
Didn't it log the internal error?
Sorry, yes, it was: Corrupted Index file <path>/.imap/Gloria/dovecot.index: Record [2].uid=0
I've had four of these between 11th-15th January, all for different folders/users, not always "Record [2]" but all of them had "uid=0".
I'm beginning to think that this has something to do with creating index files and a race condition somewhere in it. I've done a couple of fixes that should change this error message to something else:
http://dovecot.org/list/dovecot-cvs/2007-January/007375.html http://dovecot.org/list/dovecot-cvs/2007-January/007377.html http://dovecot.org/list/dovecot-cvs/2007-January/007385.html
The last patch is the most important one, if my guess is correct. You should then start seeing "Record field cache points outside record size". I'm still not sure why that happens though.
Hmm. Actually I think that would also explain the memory corruption. Great, one less thing to worry about. Now please just tell me that my fix works? ;)
Timo Sirainen wrote:
I'm beginning to think that this has something to do with creating index files and a race condition somewhere in it. I've done a couple of fixes that should change this error message to something else:
http://dovecot.org/list/dovecot-cvs/2007-January/007375.html http://dovecot.org/list/dovecot-cvs/2007-January/007377.html http://dovecot.org/list/dovecot-cvs/2007-January/007385.html
The last patch is the most important one, if my guess is correct. You should then start seeing "Record field cache points outside record size". I'm still not sure why that happens though.
Hmm. Actually I think that would also explain the memory corruption. Great, one less thing to worry about. Now please just tell me that my fix works? ;)
I've been foraging among our backup indexes and finding tapes that haven't gone off site yet. I've had three cases from today of the "Corrupted Index file" but only two crashed (one with segfault, one with bus error). I'll need to pick up tapes tomorrow morning from our other on-site tape store to get the index files. Some of the folders themselves don't appear to have changed recently, but I'll get the backup copies of them too, to be sure. I think I'll have three cases to play with as a beginning.
If I'm right, and it's the processing of these index files causing the memory corruption, I should be able to reproduce it using the recovered files on my test server. I can then try your debugging patches and the latest fixes.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
If I'm right, and it's the processing of these index files causing the memory corruption, I should be able to reproduce it using the recovered files on my test server. I can then try your debugging patches and the latest fixes.
Best Wishes, Chris
Just thought I'd update you with what I've managed!
Well I've got one folder plus index directory that causes a segfault crash both times I've tried on the test machine (just "SELECT"/"UID FETCH * BODY[]"/"LOGOUT"):
#0 0xff1c2974 in t_splay () from /usr/lib/libc.so.1 #1 0xff1c27e0 in t_delete () from /usr/lib/libc.so.1 #2 0xff1c23e4 in realfree () from /usr/lib/libc.so.1 #3 0xff1c2cb8 in cleanfree () from /usr/lib/libc.so.1 #4 0xff1c1dec in _malloc_unlocked () from /usr/lib/libc.so.1 #5 0xff1c1ce0 in malloc () from /usr/lib/libc.so.1 #6 0xff1b617c in calloc () from /usr/lib/libc.so.1 #7 0x89c18 in pool_system_malloc (pool=0xb66f0, size=28) at mempool-system.c:67 #8 0x86d7c in timeout_add (msecs=1000, callback=0x55660 <index_removal_timeout>, context=0x0) at ioloop.c:146 #9 0x55734 in index_storage_unref (index=0xb6800) at index-storage.c:192 #10 0x55d54 in index_storage_mailbox_free (box=0xc8708) at index-storage.c:395 #11 0x38efc in mbox_storage_close (box=0x0) at mbox-storage.c:1086 #12 0x715f8 in mailbox_close (_box=0xc2778) at mail-storage.c:373 #13 0x20e00 in cmd_logout (cmd=0xc27a4) at cmd-logout.c:18 #14 0x23334 in client_handle_input (cmd=0xc27a4) at client.c:331 #15 0x232b4 in client_handle_input (cmd=0xc27a4) at client.c:388 #16 0x23430 in _client_input (context=0xc2760) at client.c:428 #17 0x87924 in io_loop_handler_run (ioloop=0xbf838) at ioloop-poll.c:199 #18 0x87184 in io_loop_run (ioloop=0xbf838) at ioloop.c:281 #19 0x2c658 in main (argc=0, argv=0xffbef5dc, envp=0xffbef5ec) at main.c:280
but no "Corrupted Index" error message like there was on the original.
I've tried adding the memdebug*.diff patches, but hit problems with the lack of a walkcontext function (after originally having problems with Makefile.am and a broken autoreconf on the Solaris box - I ran autoreconf on a Linux box in the end!)
I tried the same folder on a Linux installation of 1.0-rc17, but that was 64-bit, whereas my Solaris version is 32-bit, and so the index was just invalid. I'm now trying to compile a 32-bit Linux version to try instead.
I've tried to apply the patches you suggested from CVS, but they fail. You've not been building nightly CVS snapshots, so I guess I'll have to use cvs and autoreconf on the Linux box to make my own!
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
Well I've got one folder plus index directory that causes a segfault crash both times I've tried on the test machine (just "SELECT"/"UID FETCH * BODY[]"/"LOGOUT"):
...
I tried the same folder on a Linux installation of 1.0-rc17, but that was 64-bit, whereas my Solaris version is 32-bit, and so the index was just invalid. I'm now trying to compile a 32-bit Linux version to try instead.
Well the 32-bit version doesn't crash on Linux (looks the same as the 64-bit version), but the memdebug-delayed.diff patched version on Linux gives a warning:
dovecot: Jan 17 18:55:27 Warning: IMAP 15442 xxxxxxxx xxx.xxx.xxx.xxx : Timeout leak: 0x80b1ff0
Right, dinnertime! Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Wed, 2007-01-17 at 18:59 +0000, Chris Wakelin wrote:
I tried the same folder on a Linux installation of 1.0-rc17, but that was 64-bit, whereas my Solaris version is 32-bit, and so the index was just invalid. I'm now trying to compile a 32-bit Linux version to try instead.
Well the 32-bit version doesn't crash on Linux (looks the same as the 64-bit version), but the memdebug-delayed.diff patched version on Linux gives a warning:
Was your Solaris a Sparc or x86? If it was Sparc, the index files use different endianess and I think Dovecot just silently overwrites the index without using it.
I'm pretty sure anyway that this fixes it: http://dovecot.org/list/dovecot-cvs/2007-January/007395.html
dovecot: Jan 17 18:55:27 Warning: IMAP 15442 xxxxxxxx xxx.xxx.xxx.xxx : Timeout leak: 0x80b1ff0
Yes, that's normal.
Timo Sirainen wrote:
Was your Solaris a Sparc or x86? If it was Sparc, the index files use different endianess and I think Dovecot just silently overwrites the index without using it.
:lightbulb: that explains it :)
I'm pretty sure anyway that this fixes it: http://dovecot.org/list/dovecot-cvs/2007-January/007395.html
dovecot: Jan 17 18:55:27 Warning: IMAP 15442 xxxxxxxx xxx.xxx.xxx.xxx : Timeout leak: 0x80b1ff0
Yes, that's normal.
...
I fixed the nightly snapshot building. Now it's again being built every night (if there have been any changes). Please test it before I'll make rc18 release. http://dovecot.org/nightly/dovecot-latest.tar.gz
dovecot-20070117.tar.gz seems to have fixed the reproducible crash, but I did get a couple of extra compiler warnings (gcc 3.3.2):-
mail-index.c: In function mail_index_parse_extensions': mail-index.c:342: warning: comparison between signed and unsigned mail-index.c: In function
mail_index_map_clone':
mail-index.c:1242: warning: comparison between signed and unsigned
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
Chris Wakelin wrote:
dovecot-20070117.tar.gz seems to have fixed the reproducible crash, but I did get a couple of extra compiler warnings (gcc 3.3.2):-
mail-index.c: In function
mail_index_parse_extensions': mail-index.c:342: warning: comparison between signed and unsigned mail-index.c: In function
mail_index_map_clone': mail-index.c:1242: warning: comparison between signed and unsigned
Hi Timo,
I've got three reproducible rc17 crashes, all fixed in dovecot-20070117.tar.gz, and I've managed to get the same crashes in Solaris 10 on Sparc. However, Solaris 10 has walkcontext() so I've been able to use versions with the memdebug-* patches. The memdebug-delayed.diff version doesn't crash or log anything interesting, but the memdebug-bof.diff version does:
#0 0xff154dd8 in t_splay () from /lib/libc.so.1 #1 0xff154c68 in t_delete () from /lib/libc.so.1 #2 0xff15487c in realfree () from /lib/libc.so.1 #3 0xff155104 in cleanfree () from /lib/libc.so.1 #4 0xff15425c in _malloc_unlocked () from /lib/libc.so.1 #5 0xff15414c in malloc () from /lib/libc.so.1 #6 0xff140f10 in calloc () from /lib/libc.so.1 #7 0x00080644 in pool_system_malloc (pool=0xacd8c, size=28) at mempool-system.c:67 #8 0x0007dafc in timeout_add (msecs=1000, callback=0x5086c <index_removal_timeout>, context=0x0) at ioloop.c:146 #9 0x00050924 in index_storage_unref (index=0xacc00) at index-storage.c:192 #10 0x00050bb4 in index_storage_mailbox_free (box=0xbee08) at index-storage.c:395 #11 0x000368f0 in mbox_storage_close (box=0xbee08) at mbox-storage.c:1086 #12 0x0006a14c in mailbox_close (_box=0xb8e78) at mail-storage.c:373 #13 0x00020760 in cmd_logout (cmd=0xb8ea4) at cmd-logout.c:18 #14 0x000229d8 in client_handle_input (cmd=0xb8ea4) at client.c:331 #15 0x00022950 in client_handle_input (cmd=0xb8ea4) at client.c:388 #16 0x00022b30 in _client_input (context=0xb8e60) at client.c:428 #17 0x0007e5d8 in io_loop_handler_run (ioloop=0xb5f38) at ioloop-poll.c:199 #18 0x0007deb8 in io_loop_run (ioloop=0xb5f38) at ioloop.c:281 #19 0x0002ae64 in main (argc=-4196408, argv=0xac000, envp=0xacc00) at main.c:280
for two of them and:
#0 0xff154dd8 in t_splay () from /lib/libc.so.1 #1 0xff154c68 in t_delete () from /lib/libc.so.1 #2 0xff1548a8 in realfree () from /lib/libc.so.1 #3 0xff155078 in _free_unlocked () from /lib/libc.so.1 #4 0xff154fb4 in free () from /lib/libc.so.1 #5 0x00077320 in _buffer_free (_buf=0xacfd0) at buffer.c:123 #6 0x00069860 in mail_storage_deinit () at array.h:80 #7 0x0002ae9c in main (argc=0, argv=0xac000, envp=0xacc00) at main.c:247
for the other.
Is this what you expected? Each case was LOGIN, SELECT, UID FETCH * BODY[], LOGOUT.
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Thu, 2007-01-18 at 17:42 +0000, Chris Wakelin wrote:
I've got three reproducible rc17 crashes, all fixed in dovecot-20070117.tar.gz, and I've managed to get the same crashes in Solaris 10 on Sparc. However, Solaris 10 has walkcontext() so I've been able to use versions with the memdebug-* patches. The memdebug-delayed.diff version doesn't crash or log anything interesting, but the memdebug-bof.diff version does:
memdebug-bof.diff had the possibility to notice it, but it should have logged an error in that case. Your backtraces show that it crashed so it didn't detect the overflow. Anyway, I think that's still somewhat expected because it wasn't exactly ordinary buffer overflow.
On Wed, 2007-01-17 at 22:21 +0000, Chris Wakelin wrote:
mail-index.c: In function
mail_index_parse_extensions': mail-index.c:342: warning: comparison between signed and unsigned mail-index.c: In function
mail_index_map_clone': mail-index.c:1242: warning: comparison between signed and unsigned
gcc 4.1.2 doesn't give these anymore and since the warnings are kind of wrong anyway, I'll leave them as they are.
it did not take long:
dovecot: Jan 16 23:26:08 Info: imap-login: Login: user=<rainer>, method=PLAIN, rip=192.168.1.232, lip=192.168.1.1, TLS dovecot: Jan 16 23:32:16 Info: imap-login: Login: user=<rainer>, method=PLAIN, rip=192.168.1.232, lip=192.168.1.1, TLS dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Fixed index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: first_recent_uid_lowwater 330 -> 329 dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:37:14 Error: child 1239 (imap) killed with signal 11 dovecot: Jan 16 23:39:15 Info: imap-login: Login: user=<rainer>, method=PLAIN, rip=192.168.1.232, lip=192.168.1.1, TLS dovecot: Jan 16 23:39:15 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2) dovecot: Jan 16 23:39:15 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2)
my setup: linux 2.4.34, postfix, amavisd, dovecot, thunderbird-1.5.0.9 on XP, with rules moving incoming mails depending on "subject" and "from" into various subfolders.
Timo Sirainen wrote:
On Tue, 2007-01-16 at 16:24 +0000, Chris Wakelin wrote:
Timo Sirainen wrote:
On Thu, 2007-01-11 at 21:32 +0000, Chris Wakelin wrote:
dovecot: Jan 11 11:39:59 Info: IMAP 5751 xxxxxxxx xxx.xxx.xxx.xxx : Opened Sent (18 msgs) dovecot: Jan 11 11:39:59 Error: child 5751 (imap) killed with signal 11 dovecot: Jan 11 12:28:49 Info: IMAP 12765 xxxxxxxx xxx.xxx.xxx.xxx : Disconnected: Internal error occurred. Refer to server log for more information. [2007-01-11 12:28:49] dovecot: Jan 11 12:28:49 Error: child 12765 (imap) killed with signal 10 Didn't it log the internal error? Sorry, yes, it was: Corrupted Index file <path>/.imap/Gloria/dovecot.index: Record [2].uid=0
I've had four of these between 11th-15th January, all for different folders/users, not always "Record [2]" but all of them had "uid=0".
I'm beginning to think that this has something to do with creating index files and a race condition somewhere in it. I've done a couple of fixes that should change this error message to something else:
http://dovecot.org/list/dovecot-cvs/2007-January/007375.html http://dovecot.org/list/dovecot-cvs/2007-January/007377.html http://dovecot.org/list/dovecot-cvs/2007-January/007385.html
The last patch is the most important one, if my guess is correct. You should then start seeing "Record field cache points outside record size". I'm still not sure why that happens though.
Hmm. Actually I think that would also explain the memory corruption. Great, one less thing to worry about. Now please just tell me that my fix works? ;)
-- in diesem Sinne, Rainer
On Tue, 2007-01-16 at 23:48 +0100, Rainer Ophoven wrote:
dovecot: Jan 16 23:37:14 Error: child 1239 (imap) killed with signal 11
Can you get a gdb backtrace from this?
dovecot: Jan 16 23:39:15 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2)
my setup: linux 2.4.34, postfix, amavisd, dovecot, thunderbird-1.5.0.9 on XP, with rules moving incoming mails depending on "subject" and "from" into various subfolders.
Do you use Dovecot's deliver? Does anything ever delete the index files? Do you have mmap_disable=yes? If not, try if setting it helps? Do you have any keywords set in your maildir (and does dovecot-keywords file contain anything)?
I've been doing pretty heavy stress testing in my own computer the whole day, but I haven't managed to cause this bug to happen with me. Luckily I've found and fixed a few other bugs. :)
Timo Sirainen wrote:
On Tue, 2007-01-16 at 23:48 +0100, Rainer Ophoven wrote:
dovecot: Jan 16 23:37:14 Error: child 1239 (imap) killed with signal 11
Can you get a gdb backtrace from this? unfortunately there's no core in user's home-directory.
dovecot: Jan 16 23:39:15 Error: IMAP(rainer): Corrupted index file /home/rainer/Maildir/.INBOX.dovecot/dovecot.index: Record field keywords points outside record size (12 < 12+2)
my setup: linux 2.4.34, postfix, amavisd, dovecot, thunderbird-1.5.0.9 on XP, with rules moving incoming mails depending on "subject" and "from" into various subfolders.
Do you use Dovecot's deliver? no Does anything ever delete the index files? not not that I would know of. Do you have mmap_disable=yes? no - I'll enable it. If not, try if setting it helps? Do you have any keywords set in your maildir (and does dovecot-keywords file contain anything)?
this is the contents:
0 $Forwarded 1 $MDNSent 2 Junk 3 NonJunk
I've been doing pretty heavy stress testing in my own computer the whole day, but I haven't managed to cause this bug to happen with me. Luckily I've found and fixed a few other bugs. :)
-- in diesem Sinne, Rainer
participants (3)
-
Chris Wakelin
-
Rainer Ophoven
-
Timo Sirainen