[Dovecot] A couple of dovecot issues I've noted recently...
Timo et al.,
I haven't been reading the list in a while, but wanted to report back that we've transitioned to dovecot 1.0.15 and things look to be working very nicely. The platform is Solaris "Nevada" running on x86. I have been saving up a couple of issues to bring to your attention.
On our server, we gather all core files, and so we get a good taste of the different things which are causing crashes. One problem which we see on an occasional basis is a crash in which the internal fatal error handling code goes horribly wrong. Here is an example. The complete stack trace is hugely long-- it appears that we recurse until we run totally out of stack. In gdb this looks like repeated calls like this:
... #14507 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b95d8 "\020") at failures.c:410 #14508 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14509 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14510 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14511 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14512 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14513 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14514 0x080b21f3 in internal_handler (log_type=70 'F', format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", args=0x76b96d8 "\020") at failures.c:383 #14515 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b96d8 "\020") at failures.c:410 #14516 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14517 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14518 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14519 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14520 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14521 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14522 0x080b21f3 in internal_handler (log_type=70 'F', format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", args=0x76b97d8 "\020") at failures.c:383 #14523 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b97d8 "\020") at failures.c:410 #14524 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14525 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14526 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14527 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14528 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14529 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14530 0x080b21f3 in internal_handler (log_type=70 'F', format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", args=0x76b98d8 "\020") at failures.c:383 #14531 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b98d8 "\020") at failures.c:410 #14532 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14533 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14534 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14535 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14536 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14537 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14538 0x080b21f3 in internal_handler (log_type=70 'F', ...
At the top of the stack, I see this-- unfortunately I can't use gdb to view it-- it core dumps after about 50,000 frames. In dbx:
[327610] i_internal_fatal_handler(status = 83, fmt = (nil), args = 0x8047798 "^P"), line 410 in "failures.c" [327611] i_fatal_status(status = 83, format = 0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", ...), line 201 in "failures.c" [327612] mem_block_alloc(min_size = 134510504U), line 192 in "data-stack.c" [327613] t_malloc_real(size = 0, permanent = 0), line 244 in "data-stack.c" [327614] t_buffer_get(size = 0), line 319 in "data-stack.c" [327615] _vstrconcat(str1 = (nil), args = 0x8047834 "\x9d?^L^H", ret_len = 0x8047808), line 229 in "strfuncs.c" [327616] t_strconcat(str1 = 0x100e77f0 "tmp/compiler/venus/handoff/build20.2/CDROM/sol-x86/kits/j2sdk_1.4.2_08/patches/J2SE_Solaris_8_x86_Recommended/108994-44/SUNWcsr/.svn/prop-base", ...), line 335 in "strfuncs.c" [327617] mbox_list_next(ctx = 0x80f6348), line 273 in "mbox-list.c" [327618] mbox_mailbox_list_next(_ctx = 0x8047928), line 209 in "mbox-list.c" [327619] mail_storage_mailbox_list_next(ctx = 0x8047948), line 317 in "mail-storage.c" [327620] cmd_list_continue(cmd = 0x80eff2c), line 184 in "cmd-list.c" [327621] client_destroy(client = (nil), reason = 0x80479d8 "øy^D^Hc±^F^Hèþ^N^HÃx^L^H^Hz^D^Hm¨^K^H^P\x8b^N^H^B"), line 94 in "client.c" [327622] idle_timeout(context = (nil)), line 522 in "client.c" [327623] io_loop_handle_timeouts(ioloop = (nil), update_run_now = 1U), line 294 in "ioloop.c" [327624] io_loop_handler_run(ioloop = 0x80ed6e0), line 159 in "ioloop-poll.c" [327625] io_loop_run(ioloop = (nil)), line 329 in "ioloop.c" [327626] main(argc = 1, argv = 0x8047afc, envp = 0x8047b04), line 290 in "main.c"
I wondered if anyone else has seen similar problems? It seems like once we fall into i_internal_fatal_handler(), dovecot should really try hard to avoid further dynamic memory allocation, as heap corruption seems to make that a bit of a disaster.
As for the crash in mbox_list_next, we see that one too with some frequency.
A final problem, and this one affects me personally, is that we see some errors like this:
dovecot: Jul 09 11:52:10 Error: IMAP(dp): FETCH for mailbox mail/tmp-inbox UID 92086 got too little data: 6054 vs 6057 dovecot: Jul 09 11:52:10 Info: IMAP(dp): Disconnected: Disconnected
This is frustrating because I wind up with "poisonous" messages in my inbox which cause dovecot to disconnect me when I try to read them. Across all of our users, we see this happen about 1000 times per month.
-dp
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp
On Wed, 2008-07-09 at 17:09 -0700, Dan Price wrote:
[327616] t_strconcat(str1 = 0x100e77f0 "tmp/compiler/venus/handoff/build20.2/CDROM/sol-x86/kits/j2sdk_1.4.2_08/patches/J2SE_Solaris_8_x86_Recommended/108994-44/SUNWcsr/.svn/prop-base", ...), line 335 in "strfuncs.c" [327617] mbox_list_next(ctx = 0x80f6348), line 273 in "mbox-list.c"
You don't have a separate mail directory, so users sometimes list their entire home directory which can contain a huge directory tree, causing out of memory?
I wondered if anyone else has seen similar problems? It seems like once we fall into i_internal_fatal_handler(), dovecot should really try hard to avoid further dynamic memory allocation,
In other fatal handlers there was actually code to avoid infinitely recursing, but looks like I forgot to add it to internal_handler(). Fixed: http://hg.dovecot.org/dovecot-1.0/rev/c311c66a2d1b
Also v1.1+ sets aside an "out of memory" buffer which is used when logging the "out of memory" error.
as heap corruption seems to make that a bit of a disaster.
What heap corruption?
A final problem, and this one affects me personally, is that we see some errors like this:
dovecot: Jul 09 11:52:10 Error: IMAP(dp): FETCH for mailbox mail/tmp-inbox UID 92086 got too little data: 6054 vs 6057 dovecot: Jul 09 11:52:10 Info: IMAP(dp): Disconnected: Disconnected
This is frustrating because I wind up with "poisonous" messages in my inbox which cause dovecot to disconnect me when I try to read them. Across all of our users, we see this happen about 1000 times per month.
v1.1 fixes these automatically, although it still disconnects the first time. Deleting dovecot.index.cache fixes this of course.
But as for why it happens in the first place, that's a bit difficult to guess. Is the difference always 3 bytes? Anything else than Dovecot modifying the mboxes?
On Sun 20 Jul 2008 at 06:05PM, Timo Sirainen wrote:
On Wed, 2008-07-09 at 17:09 -0700, Dan Price wrote:
You don't have a separate mail directory, so users sometimes list their entire home directory which can contain a huge directory tree, causing out of memory?
I rely on users to set their mail directory in their IMAP client appropriately-- this is because we have a historical mix of users with ~/Mail, ~/mail, and in some weird cases other dir names in their home directories where they store their mail.
I don't know why (it seems obvious now) but I hadn't worked out that it was a recursive directory walk which might be causing this. I can now hunt down these users and help them. Thanks!
Also v1.1+ sets aside an "out of memory" buffer which is used when logging the "out of memory" error.
Thanks. I've applied your patch and will roll it out as soon as I can.
What heap corruption?
Sorry, never mind.
A final problem, and this one affects me personally, is that we see some errors like this:
dovecot: Jul 09 11:52:10 Error: IMAP(dp): FETCH for mailbox mail/tmp-inbox UID 92086 got too little data: 6054 vs 6057 ...
But as for why it happens in the first place, that's a bit difficult to guess. Is the difference always 3 bytes? Anything else than Dovecot modifying the mboxes?
Seems to be 3 in most cases. Take a look; I've marked the ones which aren't offset by 3:
$ grep 'got too little' * | awk '{print $6, $17, $19}' | sort | uniq -c 6 IMAP(dp): 2490 2493 3 1 IMAP(dp): 5193 5196 3 3 IMAP(dp): 6054 6057 3 678 IMAP(helen): 1499 1710 211 <-- 22 IMAP(mimi): 2573 2690 117 <-- 208 IMAP(rav): 5651 5654 3 69 IMAP(rav): 7549 7552 3 22 IMAP(will): 3032 3035 3 397 IMAP(will): 3213 3216 3 6 IMAP(will): 3861 3864 3 106 IMAP(will): 444 447 3
My first step will be to blow away the dovecot index files for all of these users. Thanks so much for your help.
-dp
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp
On Jul 21, 2008, at 11:42 PM, Dan Price wrote:
On Sun 20 Jul 2008 at 06:05PM, Timo Sirainen wrote:
On Wed, 2008-07-09 at 17:09 -0700, Dan Price wrote:
You don't have a separate mail directory, so users sometimes list
their entire home directory which can contain a huge directory tree,
causing out of memory?I rely on users to set their mail directory in their IMAP client appropriately-- this is because we have a historical mix of users with ~/Mail, ~/mail, and in some weird cases other dir names in their home directories where they store their mail.
You could create a namespace for each of them: http://wiki.dovecot.org/Namespaces
And for the few weird cases just force the users to move their mails
to one of the standard directories.. :)
I don't know why (it seems obvious now) but I hadn't worked out that it was a recursive directory walk which might be causing this. I can now hunt down these users and help them. Thanks!
That reminds me I should some day add code to detect and not follow
loops.
A final problem, and this one affects me personally, is that we see some errors like this:
dovecot: Jul 09 11:52:10 Error: IMAP(dp): FETCH for mailbox mail/ tmp-inbox UID 92086 got too little data: 6054 vs 6057 ...
But as for why it happens in the first place, that's a bit
difficult to guess. Is the difference always 3 bytes? Anything else than Dovecot modifying the mboxes?Seems to be 3 in most cases. Take a look; I've marked the ones which aren't offset by 3:
Hmm. I don't really have any ideas why those could happen, other than
something else besides Dovecot modifying the mboxes..
My first step will be to blow away the dovecot index files for all of these users. Thanks so much for your help.
Deleting dovecot.index.cache file is enough. v1.1 also does this
automatically.
participants (2)
-
Dan Price
-
Timo Sirainen