Timo et al.,
I haven't been reading the list in a while, but wanted to report back that we've transitioned to dovecot 1.0.15 and things look to be working very nicely. The platform is Solaris "Nevada" running on x86. I have been saving up a couple of issues to bring to your attention.
On our server, we gather all core files, and so we get a good taste of the different things which are causing crashes. One problem which we see on an occasional basis is a crash in which the internal fatal error handling code goes horribly wrong. Here is an example. The complete stack trace is hugely long-- it appears that we recurse until we run totally out of stack. In gdb this looks like repeated calls like this:
... #14507 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b95d8 "\020") at failures.c:410 #14508 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14509 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14510 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14511 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14512 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14513 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14514 0x080b21f3 in internal_handler (log_type=70 'F', format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", args=0x76b96d8 "\020") at failures.c:383 #14515 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b96d8 "\020") at failures.c:410 #14516 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14517 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14518 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14519 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14520 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14521 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14522 0x080b21f3 in internal_handler (log_type=70 'F', format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", args=0x76b97d8 "\020") at failures.c:383 #14523 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b97d8 "\020") at failures.c:410 #14524 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14525 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14526 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14527 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14528 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14529 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14530 0x080b21f3 in internal_handler (log_type=70 'F', format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", args=0x76b98d8 "\020") at failures.c:383 #14531 0x080b22b6 in i_internal_fatal_handler (status=83, fmt=0x0, args=0x76b98d8 "\020") at failures.c:410 #14532 0x080b1e1e in i_fatal_status (status=83, format=0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes") at failures.c:201 #14533 0x080b1652 in mem_block_alloc (min_size=0) at data-stack.c:192 #14534 0x080b1811 in t_malloc_real (size=16, permanent=true) at data-stack.c:244 #14535 0x080b1890 in t_malloc0 (size=16) at data-stack.c:279 #14536 0x080b8715 in pool_datastack_create () at mempool-datastack.c:51 #14537 0x080bda2b in t_str_new (initial_size=512) at str.c:17 #14538 0x080b21f3 in internal_handler (log_type=70 'F', ...
At the top of the stack, I see this-- unfortunately I can't use gdb to view it-- it core dumps after about 50,000 frames. In dbx:
[327610] i_internal_fatal_handler(status = 83, fmt = (nil), args = 0x8047798 "^P"), line 410 in "failures.c" [327611] i_fatal_status(status = 83, format = 0x80cdd64 "mem_block_alloc(): Out of memory when allocating %u bytes", ...), line 201 in "failures.c" [327612] mem_block_alloc(min_size = 134510504U), line 192 in "data-stack.c" [327613] t_malloc_real(size = 0, permanent = 0), line 244 in "data-stack.c" [327614] t_buffer_get(size = 0), line 319 in "data-stack.c" [327615] _vstrconcat(str1 = (nil), args = 0x8047834 "\x9d?^L^H", ret_len = 0x8047808), line 229 in "strfuncs.c" [327616] t_strconcat(str1 = 0x100e77f0 "tmp/compiler/venus/handoff/build20.2/CDROM/sol-x86/kits/j2sdk_1.4.2_08/patches/J2SE_Solaris_8_x86_Recommended/108994-44/SUNWcsr/.svn/prop-base", ...), line 335 in "strfuncs.c" [327617] mbox_list_next(ctx = 0x80f6348), line 273 in "mbox-list.c" [327618] mbox_mailbox_list_next(_ctx = 0x8047928), line 209 in "mbox-list.c" [327619] mail_storage_mailbox_list_next(ctx = 0x8047948), line 317 in "mail-storage.c" [327620] cmd_list_continue(cmd = 0x80eff2c), line 184 in "cmd-list.c" [327621] client_destroy(client = (nil), reason = 0x80479d8 "øy^D^Hc±^F^Hèþ^N^HÃx^L^H^Hz^D^Hm¨^K^H^P\x8b^N^H^B"), line 94 in "client.c" [327622] idle_timeout(context = (nil)), line 522 in "client.c" [327623] io_loop_handle_timeouts(ioloop = (nil), update_run_now = 1U), line 294 in "ioloop.c" [327624] io_loop_handler_run(ioloop = 0x80ed6e0), line 159 in "ioloop-poll.c" [327625] io_loop_run(ioloop = (nil)), line 329 in "ioloop.c" [327626] main(argc = 1, argv = 0x8047afc, envp = 0x8047b04), line 290 in "main.c"
I wondered if anyone else has seen similar problems? It seems like once we fall into i_internal_fatal_handler(), dovecot should really try hard to avoid further dynamic memory allocation, as heap corruption seems to make that a bit of a disaster.
As for the crash in mbox_list_next, we see that one too with some frequency.
A final problem, and this one affects me personally, is that we see some errors like this:
dovecot: Jul 09 11:52:10 Error: IMAP(dp): FETCH for mailbox mail/tmp-inbox UID 92086 got too little data: 6054 vs 6057 dovecot: Jul 09 11:52:10 Info: IMAP(dp): Disconnected: Disconnected
This is frustrating because I wind up with "poisonous" messages in my inbox which cause dovecot to disconnect me when I try to read them. Across all of our users, we see this happen about 1000 times per month.
-dp
-- Daniel Price - Solaris Kernel Engineering - dp@eng.sun.com - blogs.sun.com/dp