We have been running rc15 since Sunday experiencing no trouble, but today one user crashed.
rc15 on Solaris-5.9, the server had app 6GB free memory at the time of the crash, app 70% idle running with an avg.load at 4 of 8 processors available. The USER's INBOX was quite smooth, app 13MB containing 320 messages.
The core dump was accompanied with these log-entries:
Nov 23 2006 12:27:00 [local3.error] IMAP(USER): Corrupted index cache file imapindex/USER/.imap/INBOX/dovecot.index.cache: invalid field header size 12:28:23 [local3.error] IMAP(USER): mremap_anon(1164206080) failed: Not enough space 12:28:23 [local3.error] IMAP(USER): mremap_anon(8192) failed: Invalid argument 12:28:24 [local3.error] child 17510 (imap) killed with signal 10 12:28:25 [local3.info] imap-login: Login: user=<USER>, method=PLAIN, rip=000.177.00.21, pid=19178
We have been observing a few of the 'invalid field header size' messages while running rc15. But as this USER, others seem to recover after the message has been logged.
The messages from 'mremap_anon' have never been seen before. Actually '1164206080' looks very much like a timestamp, differing 75223 seconds from the time of the crash ... spooky
I will start running the supplied code for memory debugging on this user and I enclose a back trace from the crash.
The interessting stuff from the back trace is probably that hdr (and so cache->hdr) is 0xffffffff so that hdr->continued_record_count gives no sense (Cannot access memory at address 0xb) but cache->index->map->records_count is 320
Still on this and another server running rc15 (Sol5.8), we have thousands of happy users beeing served very fast by Dovecot.
hmk
On Thu, 2006-11-23 at 16:09 +0100, Hans Morten Kind wrote:
12:27:00 [local3.error] IMAP(USER): Corrupted index cache file imapindex/USER/.imap/INBOX/dovecot.index.cache: invalid field header size 12:28:23 [local3.error] IMAP(USER): mremap_anon(1164206080) failed: Not enough space
Hmh. I don't really get how this is possible. Or do you use 32bit file offsets instead of the 64bit (which should be default)? Running configure outputs that at least.
I added anyway a couple of extra checks, but with 64bit file offsets these shouldn't happen, since only 32bit offsets are ever passed to the functions:
http://dovecot.org/list/dovecot-cvs/2006-December/006995.html
And BTW. the mail process sizes are by default limited to 256MB virtual memory, so that's why it gave "not enough space" error. Except it should have given ENOMEM, not ENOSPC, but that's probably some Solaris thing.
The messages from 'mremap_anon' have never been seen before. Actually '1164206080' looks very much like a timestamp, differing 75223 seconds from the time of the crash ... spooky
Yes, it is..
I will start running the supplied code for memory debugging on this user and I enclose a back trace from the crash.
I don't think it will help. Probably something in the cache file is treated as a file offset/size and somehow in a way I can't see right now it gets passed to file_cache_read().
As for why broken cache files are seen, I guess you're using NFS and multiple machines can access the same user's mailbox? So some cache inconsistency problems probably..
The interessting stuff from the back trace is probably that hdr (and so cache->hdr) is 0xffffffff so that hdr->continued_record_count gives no sense (Cannot access memory at address 0xb) but cache->index->map->records_count is 320
This is fixed now:
http://dovecot.org/list/dovecot-cvs/2006-December/006998.html
Hmh. I don't really get how this is possible. Or do you use 32bit file offsets instead of the 64bit (which should be default)?
No, we are not doing anything fancy with the indexes, configure reports File offsets ........................ : 64bit
And BTW. the mail process sizes are by default limited to 256MB virtual memory, so that's why it gave "not enough space" error. Except it should have given ENOMEM, not ENOSPC, but that's probably some Solaris thing.
This error has just showed up once on rc15, we are still very happy! But do you think we would gain something by adjusting mail_process_size to ie 512?
[ ... ] I don't think it will help. Probably something in the cache file is treated as a file offset/size and somehow in a way I can't see right now it gets passed to file_cache_read().
http://dovecot.org/list/dovecot-cvs/2006-December/006998.html http://dovecot.org/list/dovecot-cvs/2006-December/006995.html
I am now running with these patches and I have removed the memdebug-code.
Thanks for all your help and efforts
hmk
On 5.12.2006, at 17.44, Hans Morten Kind wrote:
And BTW. the mail process sizes are by default limited to 256MB
virtual memory, so that's why it gave "not enough space" error. Except it
should have given ENOMEM, not ENOSPC, but that's probably some Solaris
thing.This error has just showed up once on rc15, we are still very happy! But do you think we would gain something by adjusting
mail_process_size to ie 512?
No. It shouldn't use that much memory ever. If it does, it's probably
bug, like in this case and it's better to fail.
On Tue, Dec 05, 2006 at 04:44:16PM +0100, Hans Morten Kind wrote:
http://dovecot.org/list/dovecot-cvs/2006-December/006998.html http://dovecot.org/list/dovecot-cvs/2006-December/006995.html
I am now running with these patches and I have removed the memdebug-code.
But when running with these patches and without the memdebug-code, we started dumping cores again. This is solaris-2.9 with app 1500 concurrent users on pop and imap with and without ssl.
All dumps were like the first included, 'assertion failed' on '!client->disconnected' in imap/client.c:415
Then I linked dovecot with libefence, and voila, we trapped several like the second back trace included.
(gdb) p data->rec->flags Cannot access memory at address 0xfe4e5fdc
hmk
On Sun, Dec 03, 2006 at 05:29:41PM +0200, Timo Sirainen wrote:
As for why broken cache files are seen, I guess you're using NFS and multiple machines can access the same user's mailbox? So some cache inconsistency problems probably..
I dont think I did answer this.
No, dovecot and the users are only able to access their mailboxes from one server. There is NFS involved as several smtp-servers running exim will deliver mail, but dovecot and the indexes are kept on one server only. The indexes are not on a NFS-volume but folders are kept in home-directories on NFS-devices, but only accessible for dovecot from the one server having the indexes locally.
hmk
participants (2)
-
Hans Morten Kind
-
Timo Sirainen