[Dovecot] segfault in RC15
Hi all, ive seen a few segfaults in RC15. It's hard for me to reproduce but I was able to get a core when it happened with one of our customers.
RC15 FreeBSD 4.10 X86 NFS/NetApp It's squirrelmail/webmail client. Can't reproduce it, cant turn on dovecot.rawlog because I dont know in advance which customer will hit this. See it about 5 times per hour, with thousands of logins per hour.
Here's the backtrace..
Cor
Core was generated by `imap'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/lib/libc.so.4...done. Reading symbols from /usr/local/lib/dovecot/imap/lib01_quota_plugin.so...done. Reading symbols from /usr/lib/librpcsvc.so.2...done. Reading symbols from /usr/local/lib/dovecot/imap/lib02_imap_quota_plugin.so...done. Reading symbols from /usr/libexec/ld-elf.so.1...done. #0 maildir_save_file_get_path (_t=0x80d72c0, seq=37) at maildir-save.c:242 242 i_assert(seq >= ctx->first_seq); (gdb) bt full #0 maildir_save_file_get_path (_t=0x80d72c0, seq=37) at maildir-save.c:242 _t = (struct mailbox_transaction_context *) 0x80d72c0 seq = 37 ctx = (struct maildir_save_context *) 0x0 mf = (struct maildir_filename *) 0x80d72c0 #1 0x80682ca in maildir_mail_get_virtual_size (_mail=0x80e9440) at maildir-mail.c:145 mail = (struct index_mail *) 0x80e9440 mbox = (struct maildir_mailbox *) 0x80d8c40 data = (struct index_mail_data *) 0x80e94a4 path = 0x80e94a4 "" fname = 0x80d72c0 "@\214\r\bà¶\f\b\004" virtual_size = 580366801855675066 flags = 19 #2 0x8097b22 in mail_get_virtual_size (mail=0x80e9440) at mail.c:68 mail = (struct mail *) 0x80d72c0 #3 0x805cf4d in fetch_rfc822_size (ctx=0x80de088, mail=0x80e9440, context=0x0) at imap-fetch-body.c:839 ctx = (struct imap_fetch_context *) 0x80d72c0 size = 135099072 #4 0x805b395 in imap_fetch (ctx=0x80de088) at imap-fetch.c:265 ctx = (struct imap_fetch_context *) 0x80de088 handlers = (struct imap_fetch_context_handler *) 0x80de1a8 ret = 1 #5 0x8056e0b in cmd_fetch (cmd=0x80db044) at cmd-fetch.c:171 cmd = (struct client_command_context *) 0x80db044 client = (struct client *) 0x80db000 ctx = (struct imap_fetch_context *) 0x80de088 args = (struct imap_arg *) 0x80dc048 search_arg = (struct mail_search_arg *) 0x80de050 messageset = 0x25
ret = 135099072 #6 0x805955a in cmd_uid (cmd=0x80db044) at cmd-uid.c:19 cmd = (struct client_command_context *) 0x80db044 cmd_name = 0x80dc0f8 "FETCH" #7 0x8059f45 in client_handle_input (cmd=0x80db044) at client.c:382 cmd = (struct client_command_context *) 0x80db044 client = (struct client *) 0x80db000 #8 0x805a01e in _client_input (context=0x80db000) at client.c:433 client = (struct client *) 0x80db000 cmd = (struct client_command_context *) 0x80db044 ret = 2 #9 0x80a9608 in io_loop_handler_run (ioloop=0x80d7000) at ioloop-poll.c:199 ctx = (struct ioloop_handler_context *) 0x80cb0a0 pollfd = (struct pollfd *) 0x2 tv = {tv_sec = 0, tv_usec = 888475} io = (struct io *) 0x80cb4a0 t_id = 2 msecs = 135099072 ret = 0 call = 135099072 #10 0x80a901d in io_loop_run (ioloop=0x80d7000) at ioloop.c:281 ioloop = (struct ioloop *) 0x80d7000 #11 0x8060f1d in main (argc=1, argv=0xbfbff624, envp=0xbfbff62c) at main.c:280 No locals.Ok, the other sefaults are the same problem..here's another one: It looks like the filename gets corrupted or something..
Cor
Core was generated by `imap'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/lib/libc.so.4...done. Reading symbols from /usr/local/lib/dovecot/imap/lib01_quota_plugin.so...done. Reading symbols from /usr/lib/librpcsvc.so.2...done. Reading symbols from /usr/local/lib/dovecot/imap/lib02_imap_quota_plugin.so...done. Reading symbols from /usr/libexec/ld-elf.so.1...done. #0 maildir_save_file_get_path (_t=0x80d7300, seq=55) at maildir-save.c:242 242 i_assert(seq >= ctx->first_seq); (gdb) bt full #0 maildir_save_file_get_path (_t=0x80d7300, seq=55) at maildir-save.c:242 _t = (struct mailbox_transaction_context *) 0x80d7300 seq = 55 ctx = (struct maildir_save_context *) 0x0 mf = (struct maildir_filename *) 0x80d7300 #1 0x80682ca in maildir_mail_get_virtual_size (_mail=0x80e9840) at maildir-mail.c:145 mail = (struct index_mail *) 0x80e9840 mbox = (struct maildir_mailbox *) 0x80d8c40 data = (struct index_mail_data *) 0x80e98a4 path = 0x80e98a4 "" fname = 0x80d7300 "@\214\r\b��\f\b\004" virtual_size = 580366801855675066 flags = 19 #2 0x8097b22 in mail_get_virtual_size (mail=0x80e9840) at mail.c:68 mail = (struct mail *) 0x80d7300 #3 0x805cf4d in fetch_rfc822_size (ctx=0x80de088, mail=0x80e9840, context=0x0) at imap-fetch-body.c:839 ctx = (struct imap_fetch_context *) 0x80d7300 size = 135099136 #4 0x805b395 in imap_fetch (ctx=0x80de088) at imap-fetch.c:265 ctx = (struct imap_fetch_context *) 0x80de088 handlers = (struct imap_fetch_context_handler *) 0x80de1a8 ret = 1 #5 0x8056e0b in cmd_fetch (cmd=0x80db044) at cmd-fetch.c:171 cmd = (struct client_command_context *) 0x80db044 client = (struct client *) 0x80db000 ctx = (struct imap_fetch_context *) 0x80de088 args = (struct imap_arg *) 0x80dc048 search_arg = (struct mail_search_arg *) 0x80de050 messageset = 0x37
ret = 135099136 #6 0x805955a in cmd_uid (cmd=0x80db044) at cmd-uid.c:19 cmd = (struct client_command_context *) 0x80db044 cmd_name = 0x80dc0f8 "FETCH" #7 0x8059f45 in client_handle_input (cmd=0x80db044) at client.c:382 cmd = (struct client_command_context *) 0x80db044 client = (struct client *) 0x80db000 #8 0x805a01e in _client_input (context=0x80db000) at client.c:433 client = (struct client *) 0x80db000 cmd = (struct client_command_context *) 0x80db044 ret = 2 #9 0x80a9608 in io_loop_handler_run (ioloop=0x80d7000) at ioloop-poll.c:199 ctx = (struct ioloop_handler_context *) 0x80cb0a0 pollfd = (struct pollfd *) 0x2 tv = {tv_sec = 0, tv_usec = 926624} io = (struct io *) 0x80cb4a0 t_id = 2 msecs = 135099136 ret = 0 call = 135099136 #10 0x80a901d in io_loop_run (ioloop=0x80d7000) at ioloop.c:281 ioloop = (struct ioloop *) 0x80d7000 #11 0x8060f1d in main (argc=1, argv=0xbfbff5f0, envp=0xbfbff5f8) at main.c:280On Sun, 2006-12-10 at 15:07 +0100, Cor Bosman wrote:
Hi all, ive seen a few segfaults in RC15. It's hard for me to reproduce but I was able to get a core when it happened with one of our customers.
RC15 FreeBSD 4.10 X86 NFS/NetApp
Are the index files in NFS also?
See if this patch makes it give corrupted index file errors instead:
http://dovecot.org/list/dovecot-cvs/2006-December/007059.html
See if this patch makes it give corrupted index file errors instead:
http://dovecot.org/list/dovecot-cvs/2006-December/007059.html
It does:
Dec 10 18:55:44 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted Index file /var/run/dovecot-index/l/ld/XXX/.INBOX/dovecot.index: Record [27].uid=0 Dec 10 20:07:28 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted Index file /var/run/dovecot-index/d/di/XXX/.INBOX/dovecot.index: Record [1659].uid=0 Dec 10 20:07:33 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted Index file /var/run/dovecot-index/d/di/XXX/.INBOX/dovecot.index: Record [1659].uid=0
Cor
On Sun, 2006-12-10 at 20:37 +0100, Cor Bosman wrote:
See if this patch makes it give corrupted index file errors instead:
http://dovecot.org/list/dovecot-cvs/2006-December/007059.html
It does:
Dec 10 18:55:44 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted Index file /var/run/dovecot-index/l/ld/XXX/.INBOX/dovecot.index: Record [27].uid=0 Dec 10 20:07:28 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted Index file /var/run/dovecot-index/d/di/XXX/.INBOX/dovecot.index: Record [1659].uid=0 Dec 10 20:07:33 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted Index file /var/run/dovecot-index/d/di/XXX/.INBOX/dovecot.index: Record [1659].uid=0
Could you send me those broken index and log files? The attached patch copies them to /tmp when index gets corrupted.
The index and log files don't contain any message contents, so it's safe to send them.
BTW. http://dovecot.org/tools/idxview.c and http://dovecot.org/tools/logview.c can be used to look into them if you want to do it yourself. Although I think those can be compiled only against CVS HEAD.
Does the patch that detects the corrupt indices cause proper handling of this situation? In other words, customers dont notice this anymore and indices get discarded and rebuilt?
Cor
On 11.12.2006, at 13.24, Cor Bosman wrote:
Does the patch that detects the corrupt indices cause proper handling of this situation? In other words, customers dont notice this anymore and indices get discarded and rebuilt?
They'll still get some "out of sync" error and get disconnected, I
think. Next login would anyway work again.
But looking at the index files you sent me they weren't corrupted at
all, so the problem is that the in-memory views of the indexes get
somehow broken. Do you use mmap_disable=yes? You could try switching
that setting the other way. What about lock_method?
Did this problem start only with rc15? What version were you using
before?
Does the patch that detects the corrupt indices cause proper handling of this situation? In other words, customers dont notice this anymore and indices get discarded and rebuilt?
They'll still get some "out of sync" error and get disconnected, I
think. Next login would anyway work again.But looking at the index files you sent me they weren't corrupted at
all, so the problem is that the in-memory views of the indexes get
somehow broken. Do you use mmap_disable=yes? You could try switching
that setting the other way. What about lock_method?
mmap_disable = yes lock_method = fcntl
I guess we can turn mmap_disable off, since our indexes are on local FS.
Did this problem start only with rc15? What version were you using
before?
We were using beta8 before, it was the only one that didnt have some problem we couldnt live with. We've tried some RCs, but every time we had some major problem. RC15 seems workable, but with a few issues.
I'll turn mmap_disable off. Anything else I can do?
Cor
Hi Timo, turning mmap_disable off still gives me these corrupt indexes. Since you say they're actually ok, deleting them should have no impact.
Cor
Dec 10 18:55:44 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted
Index file /var/run/dovecot-index/l/ld/XXX/.INBOX/dovecot.index:
Record [27].uid=0Have you tried yet if rc17 fixes these?
No, I still see them. I have removed all index files beforehand so these are all fresh rc17 indexes..
Jan 10 20:07:31 dovecot1.xs4all.nl dovecot: IMAP(XXXXXXXX): Corrupted Index file /var/run/dovecot-index/p/po/XXXXXXXX/.INBOX/dovecot.index: Record [18].uid=0
I see it about 5-10 times per hour, which is quite rarely on our busy servers. Index files are on local FS.
Cor
Dec 10 18:55:44 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted
Index file /var/run/dovecot-index/l/ld/XXX/.INBOX/dovecot.index:
Record [27].uid=0Have you tried yet if rc17 fixes these?
Hi Timo, as I said before, they're not fixed. I dont get coredumps anymore which makes this very hard for me to debug. As we noticed before, the index files seem fine (you checked that). Would it be possible to make a patch that doesnt copy the files, but dumps the in-memory index someplace? It must be something very unlikely, as I see this on a tiny fraction of our connections. Something like 1 in every 100000. Could it be some race condition on the lock file, and a second process reads a partial index?
Cor
On Fri, 2007-01-12 at 17:08 +0100, Cor Bosman wrote:
Dec 10 18:55:44 dovecot2.xs4all.nl dovecot: IMAP(XXX): Corrupted
Index file /var/run/dovecot-index/l/ld/XXX/.INBOX/dovecot.index:
Record [27].uid=0Have you tried yet if rc17 fixes these?
Hi Timo, as I said before, they're not fixed.
I found two more view syncing bugs: http://dovecot.org/list/dovecot-cvs/2007-January/007366.html http://dovecot.org/list/dovecot-cvs/2007-January/007368.html
Maybe those would help?
I dont get coredumps anymore which makes this very hard for me to debug.
You mean Dovecot just doesn't do that, or you did it on purpose for some reason? You should get core dumps as long as:
- you have a home directory set and user can write there
- mail_drop_priv_before_exec=yes
- ulimit -c was high enough before dovecot was started
As we noticed before, the index files seem fine (you checked that). Would it be possible to make a patch that doesnt copy the files, but dumps the in-memory index someplace?
Yea, I could do that if the above patches don't help. Although the easiest way to debug this would be looking into the core file with gdb..
It must be something very unlikely, as I see this on a tiny fraction of our connections. Something like 1 in every 100000. Could it be some race condition on the lock file, and a second process reads a partial index?
Some of the reasons that I can think of are:
messages_count is too high in the header. I suppose this could have happened because of locking problems, if index gets read after the file header is updated, but before the message record is updated.
something to do with appending/copying messages. newly appended messages have uid=0 temporarily in memory.
the memory area is completely broken and is filled with zeroes.
Hmm. I'm beginning to think that 1. might be the most likely problem. Have you tried changing lock_method? Try switching to flock (or to fcntl).
I found two more view syncing bugs: http://dovecot.org/list/dovecot-cvs/2007-January/007366.html http://dovecot.org/list/dovecot-cvs/2007-January/007368.html
Will try asap.
I dont get coredumps anymore which makes this very hard for me to debug.
You mean Dovecot just doesn't do that, or you did it on purpose for some reason? You should get core dumps as long as:
I believe you added some code that catches these gracefully? I really dont see any coredumps. User can write, exec=yes, and ulimit -c set. I used to get coredumps. All I see now are these log entries, not even a signal notice.
possible to make a patch that doesnt copy the files, but dumps the in-memory index someplace?
Yea, I could do that if the above patches don't help. Although the easiest way to debug this would be looking into the core file with gdb..
It must be something very unlikely, as I see this on a tiny fraction of our connections. Something like 1 in every 100000. Could it be some race condition on the lock file, and a second process reads a partial index?
Hmm. I'm beginning to think that 1. might be the most likely problem. Have you tried changing lock_method? Try switching to flock (or to fcntl).
Ok, will do.
Cor
On Sun, 2007-01-14 at 15:08 +0100, Cor Bosman wrote:
You mean Dovecot just doesn't do that, or you did it on purpose for some reason? You should get core dumps as long as:
I believe you added some code that catches these gracefully? I really dont see any coredumps. User can write, exec=yes, and ulimit -c set. I used to get coredumps. All I see now are these log entries, not even a signal notice.
Oh, right, of course. :) You can make it crash easily though: diff -u -r1.44.2.3 mail-index-view.c --- src/lib-index/mail-index-view.c 28 Dec 2006 20:54:56 -0000 1.44.2.3 +++ src/lib-index/mail-index-view.c 14 Jan 2007 17:15:32 -0000 @@ -222,6 +222,7 @@ if (rec->uid == 0) { mail_index_set_error(view->index, "Corrupted Index file %s: " "Record [%u].uid=0", view->index->filepath, seq); + abort(); mail_index_mark_corrupted(view->index); return -1; }
Hi Timo, I added your patches, didnt help. Then I changed the locking method from fcntl to flock, also didnt help. Im still seeing the problem. I added the abort(), so now I also have coredumps..
(gdb) bt #0 0x4820afc4 in kill () from /usr/lib/libc.so.4 #1 0x4824c93e in abort () from /usr/lib/libc.so.4 #2 0x8090a4a in _view_lookup_full (view=0x80e1400, seq=45, map_r=0xbfbff31c, rec_r=0xbfbff35c) at mail-index-view.c:225 #3 0x808d97a in _tview_lookup_full (view=0x80e1400, seq=45, map_r=0xbfbff31c, rec_r=0xbfbff35c) at mail-index-transaction-view.c:76 #4 0x80910f4 in mail_index_lookup_full (view=0x80e1400, seq=45, map_r=0xbfbff31c, rec_r=0xbfbff35c) at mail-index-view.c:514 #5 0x80910d8 in mail_index_lookup (view=0x80e1400, seq=45, rec_r=0xbfbff35c) at mail-index-view.c:507 #6 0x807fb4d in index_mail_set_seq (_mail=0x80e9840, seq=45) at index-mail.c:780 #7 0x8098355 in mail_set_seq (mail=0x80e9840, seq=45) at mail.c:26 #8 0x808286e in index_storage_search_next (_ctx=0x80d7680, mail=0x80e9840) at index-search.c:975 #9 0x8099219 in mailbox_search_next (ctx=0x80d7680, mail=0x80e9840) at mail-storage.c:490 #10 0x805b6bc in imap_fetch (ctx=0x80de088) at imap-fetch.c:270 #11 0x8056f63 in cmd_fetch (cmd=0x80db044) at cmd-fetch.c:163 #12 0x80596b6 in cmd_uid (cmd=0x80db044) at cmd-uid.c:19 #13 0x8059f9a in client_handle_input (cmd=0x80db044) at client.c:332 #14 0x805a0c9 in client_handle_input (cmd=0x80db044) at client.c:389 #15 0x805a16e in _client_input (context=0x80db000) at client.c:429 ---Type <return> to continue, or q <return> to quit--- #16 0x80aa20c in io_loop_handler_run (ioloop=0x80d7000) at ioloop-poll.c:199 #17 0x80a9c1d in io_loop_run (ioloop=0x80d7000) at ioloop.c:281 #18 0x8061249 in main (argc=1, argv=0xbfbff5f0, envp=0xbfbff5f8) at main.c:280
I can send you the binary and the core file if you're interested..
Cor
participants (2)
-
Cor Bosman
-
Timo Sirainen