Re: [Dovecot] dovecot-1.2.8 imap crash (with backtrace)
I found this post from a couple of weeks ago:
http://www.mail-archive.com/dovecot@dovecot.org/msg24127.html
Any update on this? I'm experiencing the same issue. Out of a user population of 4000, I get about 20 of these dumps a day, so it's not major issue, but I thought I'd follow up.
Here is an example:
I can provide a more detailed debug log from teh core dump if necessary, but since the above email already had it I figured I'd wait.
Note this is *not* the same issue as:
- maildir: v1.2.7 and v1.2.8 caused assert-crashes in
maildir_uidlist_records_drop_expunges()
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there.
dovecot -n
bash-3.2# /usr/sbin/dovecot -n # 1.2.8: /etc/dovecot.conf # OS: Linux 2.6.18-164.6.1.el5 x86_64 CentOS release 5.4 (Final) listen: * ssl_cert_file: /rci/local/certs/imapd-pop3d.pem ssl_key_file: /rci/local/certs/imapd-pop3d.pem disable_plaintext_auth: yes login_dir: /var/run/dovecot/login login_executable(default): /usr/libexec/dovecot/imap-login login_executable(imap): /usr/libexec/dovecot/imap-login login_executable(pop3): /usr/libexec/dovecot/pop3-login login_max_processes_count: 2048 max_mail_processes: 2048 first_valid_uid: 100 mail_location: maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot mmap_disable: yes mail_nfs_storage: yes mail_nfs_index: yes mail_drop_priv_before_exec: yes mail_executable(default): /rci/local/etc/convert-courier-imap.sh mail_executable(imap): /rci/local/etc/convert-courier-imap.sh mail_executable(pop3): /rci/local/etc/convert-courier-pop.sh mail_plugin_dir(default): /usr/lib64/dovecot/imap mail_plugin_dir(imap): /usr/lib64/dovecot/imap mail_plugin_dir(pop3): /usr/lib64/dovecot/pop3 pop3_uidl_format(default): %08Xu%08Xv pop3_uidl_format(imap): %08Xu%08Xv pop3_uidl_format(pop3): UID%u-%v namespace: type: private separator: . prefix: INBOX. inbox: yes list: yes subscriptions: yes auth default: passdb: driver: pam args: dovecot userdb: driver: passwd
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
On Thu, 2009-12-17 at 12:49 -0500, David Halik wrote:
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there.
Do you also see the "duplicate file entry" before the crash?
maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot
Are the index/control files on NFS? Are there multiple different servers accessing mail data?
On 12/17/2009 01:07 PM, Timo Sirainen wrote:
On Thu, 2009-12-17 at 12:49 -0500, David Halik wrote:
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there.
Do you also see the "duplicate file entry" before the crash?
Yes, the duplicate file entry is always reported immediately before the crash, just as Ralf reported too. You can see it in this example pastebin I took from one of our users:
maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot
Are the index/control files on NFS? Are there multiple different servers accessing mail data?
Correct. All index, control files, amd mail storage are located on NFS and there are multiple load balanced servers accessing the NFS data. We're currently running with:
mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_storage = yes mail_nfs_index = yes
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
Just wanted to update you that I just upgraded all of our servers to 1.2.9 and I'm still seeing the array_delete panic:
Dec 21 12:10:16 gehenna11.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 21 12:15:12 gehenna19.rutgers.edu dovecot: IMAP(user2): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
I also started receiving a good deal of these, but only from one user so far:
Dec 21 12:16:42 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:19:57 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
Let me know if you need full backtraces from the core dump.
On 12/17/2009 02:06 PM, David Halik wrote:
On 12/17/2009 01:07 PM, Timo Sirainen wrote:
On Thu, 2009-12-17 at 12:49 -0500, David Halik wrote:
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there. Do you also see the "duplicate file entry" before the crash?
Yes, the duplicate file entry is always reported immediately before the crash, just as Ralf reported too. You can see it in this example pastebin I took from one of our users:
maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot Are the index/control files on NFS? Are there multiple different servers accessing mail data?
Correct. All index, control files, amd mail storage are located on NFS and there are multiple load balanced servers accessing the NFS data. We're currently running with:
mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_storage = yes mail_nfs_index = yes
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
I'm seeing both of these dumps on multiple users now with 1.2.9, so I went ahead and did backtraces for them both.
maildir_uidlist_records_array_delete panic: http://pastebin.com/f20614d8
ns_get_listed_prefix panic: http://pastebin.com/f1420194c
On 12/21/2009 12:43 PM, David Halik wrote:
Just wanted to update you that I just upgraded all of our servers to 1.2.9 and I'm still seeing the array_delete panic:
Dec 21 12:10:16 gehenna11.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 21 12:15:12 gehenna19.rutgers.edu dovecot: IMAP(user2): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
I also started receiving a good deal of these, but only from one user so far:
Dec 21 12:16:42 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:19:57 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
Let me know if you need full backtraces from the core dump.
On 12/17/2009 02:06 PM, David Halik wrote:
On 12/17/2009 01:07 PM, Timo Sirainen wrote:
On Thu, 2009-12-17 at 12:49 -0500, David Halik wrote:
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there. Do you also see the "duplicate file entry" before the crash?
Yes, the duplicate file entry is always reported immediately before the crash, just as Ralf reported too. You can see it in this example pastebin I took from one of our users:
maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot Are the index/control files on NFS? Are there multiple different servers accessing mail data?
Correct. All index, control files, amd mail storage are located on NFS and there are multiple load balanced servers accessing the NFS data. We're currently running with:
mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_storage = yes mail_nfs_index = yes
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
We've started seeing the maildir_uidlist_records_array_delete assert crash as well. It always seems to be preceded by a 'stale NFS file handle' error from a the same user on a different connection.
Dec 22 10:12:20 oh-popmap5p dovecot: imap: user=<apbao>, rip=a.a.a.a, pid=2439: fdatasync(/home11/apbao/Maildir/dovecot-uidlist) failed: Stale NFS file handle Dec 22 10:12:20 oh-popmap5p dovecot: imap: user=<apbao>, rip=a.a.a.a, pid=2439: /home11/apbao/Maildir/dovecot-uidlist: next_uid was lowered (2642 -> 2641, hdr=2641) Dec 22 11:17:26 cc-popmap2p dovecot: imap: user=<apbao>, rip=b.b.b.b, pid=28088: Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 22 11:17:26 cc-popmap2p dovecot: imap: user=<apbao>, rip=b.b.b.b, pid=28088: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) [0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap [0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) [0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x42f107] -> imap(cmd_sync_delayed+0x1c6) [0x42f663] -> imap(client_handle_input+0x119) [0x4244d4] -> imap(client_input+0xb4) [0x424594] -> imap(io_loop_handler_run+0x17d) [0x4e5020] -> imap(io_loop_run+0x3b) [0x4e4214] -> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c4ea1d994] -> imap [0x419aa9] Dec 22 11:17:26 cc-popmap2p dovecot: dovecot: child 28088 (imap) killed with signal 6 (core dumped)
Dec 22 13:16:49 cc-popmap3p dovecot: imap: user=<ndunn>, rip=x.x.x.x, pid=3908: fdatasync(/home2/ndunn/Maildir/dovecot-uidlist) failed: Stale NFS file handle Dec 22 13:25:16 cc-popmap3p dovecot: imap: user=<ndunn>, rip=y.y.y.y, pid=3228: Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 22 13:25:16 cc-popmap3p dovecot: imap: user=<ndunn>, rip=y.y.y.y, pid=3228: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) [0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap [0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) [0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x42f107] -> imap(cmd_sync_delayed+0x1c6) [0x42f663] -> imap(client_handle_input+0x119) [0x4244d4] -> imap(client_input+0xb4) [0x424594] -> imap(io_loop_handler_run+0x17d) [0x4e5020] -> imap(io_loop_run+0x3b) [0x4e4214] -> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3e5021d994] -> imap [0x419aa9] Dec 22 13:25:16 cc-popmap3p dovecot: dovecot: child 3228 (imap) killed with signal 6 (core dumped)
I will note that we did not start seeing this crash until we took 'noac' out of our NFS mount options, as discussed on this list late last week. On the other hand, load on our NFS server (as measured in IOPS/sec) has dropped by a factor of 10.
-Brad
-----Original Message----- From: dovecot-bounces+brandond=uoregon.edu@dovecot.org [mailto:dovecot- bounces+brandond=uoregon.edu@dovecot.org] On Behalf Of David Halik Sent: Tuesday, December 22, 2009 7:48 AM To: dovecot@dovecot.org Subject: Re: [Dovecot] dovecot-1.2.8 imap crash (with backtrace)
I'm seeing both of these dumps on multiple users now with 1.2.9, so I went ahead and did backtraces for them both.
maildir_uidlist_records_array_delete panic: http://pastebin.com/f20614d8
ns_get_listed_prefix panic: http://pastebin.com/f1420194c
On 12/21/2009 12:43 PM, David Halik wrote:
Just wanted to update you that I just upgraded all of our servers to 1.2.9 and I'm still seeing the array_delete panic:
Dec 21 12:10:16 gehenna11.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 21 12:15:12 gehenna19.rutgers.edu dovecot: IMAP(user2): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
I also started receiving a good deal of these, but only from one user so far:
Dec 21 12:16:42 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:19:57 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
Let me know if you need full backtraces from the core dump.
On 12/17/2009 02:06 PM, David Halik wrote:
On 12/17/2009 01:07 PM, Timo Sirainen wrote:
On Thu, 2009-12-17 at 12:49 -0500, David Halik wrote:
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there. Do you also see the "duplicate file entry" before the crash?
Yes, the duplicate file entry is always reported immediately before the crash, just as Ralf reported too. You can see it in this example pastebin I took from one of our users:
maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot Are the index/control files on NFS? Are there multiple different servers accessing mail data?
Correct. All index, control files, amd mail storage are located on NFS and there are multiple load balanced servers accessing the NFS data. We're currently running with:
mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_storage = yes mail_nfs_index = yes
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
I just double checked and we're seeing the same thing here, although we were never running with "noac" to begin with. Our poor NFS server would melt probably. ;)
This is immediately before the crash:
Dec 22 13:09:20 gehenna14.rutgers.edu dovecot: IMAP(user): fdatasync(/rci/nqu/rci/u5/bruben/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle Dec 22 13:09:20 gehenna14.rutgers.edu dovecot: IMAP(user): /rci/nqu/rci/u5/bruben/dovecot/.INBOX/dovecot-uidlist: next_uid was lowered (1507 -> 1506, hdr=1506) Dec 22 13:12:06 gehenna14.rutgers.edu dovecot: IMAP(user): /rci/nqu/rci/u5/bruben/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 3: 1249348137.M369064P26267V04240006I01ED7E09_0.gehenna5.rutgers.edu,S=15857:2,S (uid 1208 -> 1508)
On 12/22/2009 04:33 PM, Brandon Davidson wrote:
We've started seeing the maildir_uidlist_records_array_delete assert crash as well. It always seems to be preceded by a 'stale NFS file handle' error from a the same user on a different connection.
Dec 22 10:12:20 oh-popmap5p dovecot: imap: user=<apbao>, rip=a.a.a.a, pid=2439: fdatasync(/home11/apbao/Maildir/dovecot-uidlist) failed: Stale NFS file handle Dec 22 10:12:20 oh-popmap5p dovecot: imap: user=<apbao>, rip=a.a.a.a, pid=2439: /home11/apbao/Maildir/dovecot-uidlist: next_uid was lowered (2642 -> 2641, hdr=2641) Dec 22 11:17:26 cc-popmap2p dovecot: imap: user=<apbao>, rip=b.b.b.b, pid=28088: Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 22 11:17:26 cc-popmap2p dovecot: imap: user=<apbao>, rip=b.b.b.b, pid=28088: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) [0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap [0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) [0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x42f107] -> imap(cmd_sync_delayed+0x1c6) [0x42f663] -> imap(client_handle_input+0x119) [0x4244d4] -> imap(client_input+0xb4) [0x424594] -> imap(io_loop_handler_run+0x17d) [0x4e5020] -> imap(io_loop_run+0x3b) [0x4e4214] -> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3c4ea1d994] -> imap [0x419aa9] Dec 22 11:17:26 cc-popmap2p dovecot: dovecot: child 28088 (imap) killed with signal 6 (core dumped)
Dec 22 13:16:49 cc-popmap3p dovecot: imap: user=<ndunn>, rip=x.x.x.x, pid=3908: fdatasync(/home2/ndunn/Maildir/dovecot-uidlist) failed: Stale NFS file handle Dec 22 13:25:16 cc-popmap3p dovecot: imap: user=<ndunn>, rip=y.y.y.y, pid=3228: Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 22 13:25:16 cc-popmap3p dovecot: imap: user=<ndunn>, rip=y.y.y.y, pid=3228: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) [0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap [0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) [0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x42f107] -> imap(cmd_sync_delayed+0x1c6) [0x42f663] -> imap(client_handle_input+0x119) [0x4244d4] -> imap(client_input+0xb4) [0x424594] -> imap(io_loop_handler_run+0x17d) [0x4e5020] -> imap(io_loop_run+0x3b) [0x4e4214] -> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3e5021d994] -> imap [0x419aa9] Dec 22 13:25:16 cc-popmap3p dovecot: dovecot: child 3228 (imap) killed with signal 6 (core dumped)
I will note that we did not start seeing this crash until we took 'noac' out of our NFS mount options, as discussed on this list late last week. On the other hand, load on our NFS server (as measured in IOPS/sec) has dropped by a factor of 10.
-Brad
-----Original Message----- From: dovecot-bounces+brandond=uoregon.edu@dovecot.org [mailto:dovecot- bounces+brandond=uoregon.edu@dovecot.org] On Behalf Of David Halik Sent: Tuesday, December 22, 2009 7:48 AM To: dovecot@dovecot.org Subject: Re: [Dovecot] dovecot-1.2.8 imap crash (with backtrace)
I'm seeing both of these dumps on multiple users now with 1.2.9, so I went ahead and did backtraces for them both.
maildir_uidlist_records_array_delete panic: http://pastebin.com/f20614d8
ns_get_listed_prefix panic: http://pastebin.com/f1420194c
On 12/21/2009 12:43 PM, David Halik wrote:
Just wanted to update you that I just upgraded all of our servers to 1.2.9 and I'm still seeing the array_delete panic:
Dec 21 12:10:16 gehenna11.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 21 12:15:12 gehenna19.rutgers.edu dovecot: IMAP(user2): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
I also started receiving a good deal of these, but only from one user so far:
Dec 21 12:16:42 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 21 12:19:57 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
Let me know if you need full backtraces from the core dump.
On 12/17/2009 02:06 PM, David Halik wrote:
On 12/17/2009 01:07 PM, Timo Sirainen wrote:
On Thu, 2009-12-17 at 12:49 -0500, David Halik wrote:
I applied those patches to my 1.2.8 installation before 1.2.9 was released and that immediately fixed the expunge crash, but the array_delete bug is still there.
Do you also see the "duplicate file entry" before the crash?
Yes, the duplicate file entry is always reported immediately before the crash, just as Ralf reported too. You can see it in this example pastebin I took from one of our users:
maildir:~/Maildir:INDEX=/rci/nqu%h/dovecot:CONTROL=/rci/nqu%h/dovecot
Are the index/control files on NFS? Are there multiple different servers accessing mail data?
Correct. All index, control files, amd mail storage are located on NFS and there are multiple load balanced servers accessing the NFS data. We're currently running with:
mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_storage = yes mail_nfs_index = yes
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
On 22.12.2009, at 16.42, David Halik wrote:
I just double checked and we're seeing the same thing here, although we were never running with "noac" to begin with. Our poor NFS server would melt probably. ;)
This is immediately before the crash:
Dec 22 13:09:20 gehenna14.rutgers.edu dovecot: IMAP(user): fdatasync(/rci/nqu/rci/u5/bruben/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle
The interesting this is that this function is called only when dovecot-uidlist.lock has been created, i.e. when uidlist is locked, meaning nothing should have deleted the dovecot-uidlist while it was being written to, so this error just shouldn't happen.. So apparently the dotlocking just isn't working properly for your NFS servers. See if dotlock_use_excl=no helps?
I switched all of our servers to dotlock_use_excl=no last night, but we're still seeing the errors:
Dec 23 08:22:04 gehenna17.rutgers.edu dovecot: imap-login: Login: user=<user1>, method=PLAIN, rip=xxx.xxx.xxx.xxx, lip=xxx.xxx.xxx.xxx, TLS Dec 23 08:36:21 gehenna11.rutgers.edu dovecot: imap-login: Login: user=<user1>, method=PLAIN, rip=xxx.xxx.xxx.xxx, lip=xxx.xxx.xxx.xxx, TLS Dec 23 09:17:09 gehenna18.rutgers.edu dovecot: imap-login: Login: user=<user1>, method=PLAIN, rip=xxx.xxx.xxx.xxx, lip=xxx.xxx.xxx.xxx, TLS Dec 23 10:59:37 gehenna11.rutgers.edu dovecot: IMAP(user1): fdatasync(/rci/nqu/rci/u1/user1/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle Dec 23 10:59:37 gehenna11.rutgers.edu dovecot: IMAP(user1): /rci/nqu/rci/u1/user1/dovecot/.INBOX/dovecot-uidlist: next_uid was lowered (512 -> 511, hdr=511) Dec 23 11:05:43 gehenna11.rutgers.edu dovecot: IMAP(user1): Maildir /rci/u1/user1/Maildir: Expunged message reappeared, giving a new UID (old uid=1, file=1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S) Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): /rci/nqu/rci/u1/user1/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 4: 1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S (uid 1 -> 514) Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
It must be related to locking though because we can see that the user is logged in on more than one server and the issues occur when the two dovecot instances fight over the uidlist.
The second error I'm seeing much more of though. A few users are spawning these errors on almost every connection. This doesn't appear to be locking related, since the user is logged in on only one server as far as I can tell and there are also no NFS or duplicate file errors:
Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: imap-login: Login: user=<user2>, method=PLAIN, rip=xxx.xxx.xxx.xxx, lip=xxx.xxx.xxx.xxx, TLS Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: IMAP(user2): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: IMAP(user2): Raw backtrace: /usr/libexec/dovecot/imap [0x49ed50] -> /usr/libexec/dovecot/imap [0x49edb3] -> /usr/libexec/dovecot/imap [0x49e416] -> /usr/libexec/dovecot/imap [0x41c7c0] -> /us r/libexec/dovecot/imap [0x41c9c2] -> /usr/libexec/dovecot/imap(cmd_list_full+0x49a) [0x41d60a] -> /usr/libexec/dovecot/imap(cmd_list+0xb) [0x41d8ab] -> /usr/libexec/dovecot/imap [0x4208cc] -> /usr/libexec/dovecot/imap [0x420982] -> /usr/ libexec/dovecot/imap(client_handle_input+0x3f) [0x420abf] -> /usr/libexec/dovecot/imap(client_input+0x5f) [0x42160f] -> /usr/libexec/dovecot/imap(io_loop_handler_run+0xf8) [0x4a6bd8] -> /usr/libexec/dovecot/imap(io_loop_run+0x1d) [0x4a5c ed] -> /usr/libexec/dovecot/imap(main+0x620) [0x428ef0] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x38af41d994] -> /usr/libexec/dovecot/imap [0x419ac9] Dec 23 11:18:33 gehenna17.rutgers.edu dovecot: imap-login: Login: user=<user2>, method=PLAIN, rip=xxx.xxx.xxx.xxx, lip=xxx.xxx.xxx.xxx, TLS Dec 23 11:18:33 gehenna17.rutgers.edu dovecot: IMAP(user2): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES) Dec 23 11:18:33 gehenna17.rutgers.edu dovecot: IMAP(user2): Raw backtrace: /usr/libexec/dovecot/imap [0x49ed50] -> /usr/libexec/dovecot/imap [0x49edb3] -> /usr/libexec/dovecot/imap [0x49e416] -> /usr/libexec/dovecot/imap [0x41c7c0] -> /us r/libexec/dovecot/imap [0x41c9c2] -> /usr/libexec/dovecot/imap(cmd_list_full+0x49a) [0x41d60a] -> /usr/libexec/dovecot/imap(cmd_list+0xb) [0x41d8ab] -> /usr/libexec/dovecot/imap [0x4208cc] -> /usr/libexec/dovecot/imap [0x420982] -> /usr/ libexec/dovecot/imap(client_handle_input+0x3f) [0x420abf] -> /usr/libexec/dovecot/imap(client_input+0x5f) [0x42160f] -> /usr/libexec/dovecot/imap(io_loop_handler_run+0xf8) [0x4a6bd8] -> /usr/libexec/dovecot/imap(io_loop_run+0x1d) [0x4a5c ed] -> /usr/libexec/dovecot/imap(main+0x620) [0x428ef0] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x38af41d994] -> /usr/libexec/dovecot/imap [0x419ac9]
On 12/22/2009 08:17 PM, Timo Sirainen wrote:
On 22.12.2009, at 16.42, David Halik wrote:
I just double checked and we're seeing the same thing here, although we were never running with "noac" to begin with. Our poor NFS server would melt probably. ;)
This is immediately before the crash:
Dec 22 13:09:20 gehenna14.rutgers.edu dovecot: IMAP(user): fdatasync(/rci/nqu/rci/u5/bruben/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle
The interesting this is that this function is called only when dovecot-uidlist.lock has been created, i.e. when uidlist is locked, meaning nothing should have deleted the dovecot-uidlist while it was being written to, so this error just shouldn't happen.. So apparently the dotlocking just isn't working properly for your NFS servers. See if dotlock_use_excl=no helps?
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
On Wed, 2009-12-23 at 11:37 -0500, David Halik wrote:
Dec 23 11:05:43 gehenna11.rutgers.edu dovecot: IMAP(user1): Maildir /rci/u1/user1/Maildir: Expunged message reappeared, giving a new UID (old uid=1, file=1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S)
The above causes:
Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): /rci/nqu/rci/u1/user1/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 4: 1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S (uid 1 -> 514)
Although in my tests it doesn't..
Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
This is hopefully fixed by: http://hg.dovecot.org/dovecot-1.2/rev/d00abdaf9df2
Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: IMAP(user2): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
So you get a core dump? Could you print a few values:
fr 6 (or whatever starts giving the right values) p *ctx p *ctx.ns p ctx.patterns[0] p ctx.patterns[1] p ctx.patterns[2] p match p ns_prefix p p
The ctx.patterns values are the most important ones I think.
Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
This is hopefully fixed by: http://hg.dovecot.org/dovecot-1.2/rev/d00abdaf9df2
Thanks for the help as always. I'm currently patching 1.2.9 with this and pushing the changes out to our dovecot servers. I'll update you later today if we stop seeing the panic.
Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: IMAP(user2): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
So you get a core dump? Could you print a few values:
fr 6 (or whatever starts giving the right values) p *ctx p *ctx.ns p ctx.patterns[0] p ctx.patterns[1] p ctx.patterns[2] p match p ns_prefix p p
The ctx.patterns values are the most important ones I think.
The values from lines 5-7 all look the the same. Here's the answers from line 6 though:
(gdb) fr 6 #6 0x000000000041c9c2 in cmd_list_continue (cmd=0x1212dcc8) at cmd-list.c:375 375 list_namespace_send_prefix(ctx, TRUE); (gdb) p *ctx $17 = {cmd = 0x1212dcc8, ref = 0x12132b50 "", patterns = 0x1212de48, list_flags = 16386, status_items = 0, ns = 0x1212bf70, list_iter = 0x121304f8, ns_prefixes_listed = {arr = {buffer = 0x1212de58, element_size = 8}, v = 0x1212de58, v_modifiable = 0x1212de58}, lsub = 0, lsub_no_unsubscribed = 0, inbox_found = 1, seen_inbox_namespace = 1, cur_ns_match_inbox = 1, cur_ns_send_prefix = 0, cur_ns_skip_trailing_sep = 1, used_listext = 0} (gdb) p *ctx.ns $18 = {next = 0x0, type = NAMESPACE_PRIVATE, sep = 46 '.', real_sep = 46 '.', sep_str = ".\000", flags = 8213, prefix = 0x1212bfd0 "INBOX.", prefix_len = 6, alias_for = 0x0, alias_chain_next = 0x0, user = 0x1212b988, owner = 0x1212b988, list = 0x1212c348, 0x1212c038} (gdb) p ctx.patterns[0] $19 = 0x12132b58 "Inbox" (gdb) p ctx.patterns[1] $20 = 0x0 (gdb) p ctx.patterns[2] $21 = 0x1212de90 "" (gdb) p match No symbol "match" in current context. (gdb) p ns_prefix No symbol "ns_prefix" in current context. (gdb) p p No symbol "p" in current context.
...and here's the full trace for reference
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
On Wed, 2009-12-23 at 14:06 -0500, David Halik wrote:
Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: IMAP(user2): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
It looks like this and the previous patch you gave me fixed both panics as intended. I pushed the changes out about five hours ago and haven't seen a single core dump yet! Thanks for such a quick fix Timo, you rock as always. If I see any change in behavior I'll post, but I think that did it.
I have one or two squat panics that I'm going to send in after the holidays, but they're much less of a concern since we don't run them on our production systems yet and they're not that big of a deal.
Thanks again.
On 12/23/2009 4:01 PM, Timo Sirainen wrote:
On Wed, 2009-12-23 at 14:06 -0500, David Halik wrote:
Dec 23 11:18:32 gehenna17.rutgers.edu dovecot: IMAP(user2): Panic: file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed: (match == IMAP_MATCH_YES)
Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
This is hopefully fixed by: http://hg.dovecot.org/dovecot-1.2/rev/d00abdaf9df2
Timo, I came into work this morning and found three new core dumps from maildir_uidlist_records_array_delete while running the patch you gave me yesterday, so I guess it's not fixed after all. I double checked that it is in fact running your patch, and it is. You can tell also that maildir-uidlist.c: line 403 changed to 405.
Here's the short backtrace:
#0 0x0000003543030265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003543031d10 in abort () at abort.c:88 #2 0x000000000049eddd in default_fatal_finish (type=<value optimized out>, status=0) at failures.c:160 #3 0x000000000049ee33 in i_internal_fatal_handler (type=LOG_TYPE_PANIC, status=0, fmt=<value optimized out>, args=<value optimized out>) at failures.c:443 #4 0x000000000049e496 in i_panic (format=0x6
) at failures.c:207 #5 0x000000000043f064 in maildir_uidlist_records_array_delete (uidlist=<value optimized out>, rec=0x16dbd48) at maildir-uidlist.c:405 #6 0x000000000043fbf0 in maildir_uidlist_refresh (uidlist=0x16c75b0) at maildir-uidlist.c:552 #7 0x000000000043c91f in maildir_sync_context (ctx=0x16a7188, forced=false, find_uid=0x0, lost_files_r=0x7fff86f83d87) at maildir-sync.c:659 #8 0x000000000043cefe in maildir_storage_sync_init (box=0x16b64f8, flags=0) at maildir-sync.c:924 #9 0x00000000004283d7 in imap_sync_init (client=0x16b27a0, box=0x16b64f8, imap_flags=0, flags=0) at imap-sync.c:146 #10 0x000000000041bf5e in idle_sync_now (box=0x4a58, ctx=0x16b3db8) at cmd-idle.c:127 #11 0x0000000000458ecd in check_timeout (ibox=0x16b64f8) at index-mailbox-check.c:43 #12 0x00000000004a5f6b in io_loop_handle_timeouts (ioloop=0x16afb60) at ioloop.c:313 #13 0x00000000004a6bd3 in io_loop_handler_run (ioloop=0x16afb60) at ioloop-epoll.c:180 #14 0x00000000004a5d6d in io_loop_run (ioloop=0x16afb60) at ioloop.c:335 #15 0x0000000000428f20 in main (argc=<value optimized out>, argv=<value optimized out>, envp=0x7fff86f840b8) at main.c:327And the full backtrace:
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
I should probably also post the messages leading up for reference. Note that I did not see any stale NFS messages this time, but did get the usual duplicate file messages:
Dec 24 10:43:07 gehenna11 dovecot: IMAP(user): /rci/nqu/rci/u1/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 4: 1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S (uid 514 -> 721) Dec 24 10:43:08 gehenna11 dovecot: IMAP(user): Maildir /rci/u1/user/Maildir: Expunged message reappeared, giving a new UID (old uid=516, file=1231952516.M164875P24494V03E80009I01B6DCAC_0.gehenna9.rutgers.edu,S=4355:2,S) Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): /rci/nqu/rci/u1/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 6: 1233524267.M97899P9543V04240006I01B6E908_0.gehenna5.rutgers.edu,S=3039:2,S (uid 517 -> 723) Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): Raw backtrace: /usr/libexec/dovecot/imap [0x49edd0] -> /usr/libexec/dovecot/imap [0x49ee33] -> /usr/libexec/dovecot/imap [0x49e496] -> /usr/libexec/dovecot/imap [0x43f064] -> /usr/libexec/dovecot/imap(maildir_uidlist_refresh+0x2d0) [0x43fbf0] -> /usr/libexec/dovecot/imap [0x43c91f] -> /usr/libexec/dovecot/imap(maildir_storage_sync_init+0x14e) [0x43cefe] -> /usr/libexec/dovecot/imap(imap_sync_init+0x67) [0x4283d7] -> /usr/libexec/dovecot/imap [0x41bf5e] -> /usr/libexec/dovecot/imap [0x458ecd] -> /usr/libexec/dovecot/imap(io_loop_handle_timeouts+0x8b) [0x4a5f6b] -> /usr/libexec/dovecot/imap(io_loop_handler_run+0x73) [0x4a6bd3] -> /usr/libexec/dovecot/imap(io_loop_run+0x1d) [0x4a5d6d] -> /usr/libexec/dovecot/imap(main+0x620) [0x428f20] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x354301d994] -> /usr/libexec/dovecot/imap [0x419ac9] Dec 24 10:44:38 gehenna11 dovecot: dovecot: child 19032 (imap) killed with signal 6 (core dumped)
On 12/24/2009 11:26 AM, David Halik wrote:
Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) This is hopefully fixed by: http://hg.dovecot.org/dovecot-1.2/rev/d00abdaf9df2
Timo, I came into work this morning and found three new core dumps from maildir_uidlist_records_array_delete while running the patch you gave me yesterday, so I guess it's not fixed after all. I double checked that it is in fact running your patch, and it is. You can tell also that maildir-uidlist.c: line 403 changed to 405.
Here's the short backtrace:
#0 0x0000003543030265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003543031d10 in abort () at abort.c:88 #2 0x000000000049eddd in default_fatal_finish (type=<value optimized out>, status=0) at failures.c:160 #3 0x000000000049ee33 in i_internal_fatal_handler (type=LOG_TYPE_PANIC, status=0, fmt=<value optimized out>, args=<value optimized out>) at failures.c:443 #4 0x000000000049e496 in i_panic (format=0x6
) at failures.c:207 #5 0x000000000043f064 in maildir_uidlist_records_array_delete (uidlist=<value optimized out>, rec=0x16dbd48) at maildir-uidlist.c:405 #6 0x000000000043fbf0 in maildir_uidlist_refresh (uidlist=0x16c75b0) at maildir-uidlist.c:552 #7 0x000000000043c91f in maildir_sync_context (ctx=0x16a7188, forced=false, find_uid=0x0, lost_files_r=0x7fff86f83d87) at maildir-sync.c:659 #8 0x000000000043cefe in maildir_storage_sync_init (box=0x16b64f8, flags=0) at maildir-sync.c:924 #9 0x00000000004283d7 in imap_sync_init (client=0x16b27a0, box=0x16b64f8, imap_flags=0, flags=0) at imap-sync.c:146 #10 0x000000000041bf5e in idle_sync_now (box=0x4a58, ctx=0x16b3db8) at cmd-idle.c:127 #11 0x0000000000458ecd in check_timeout (ibox=0x16b64f8) at index-mailbox-check.c:43 #12 0x00000000004a5f6b in io_loop_handle_timeouts (ioloop=0x16afb60) at ioloop.c:313 #13 0x00000000004a6bd3 in io_loop_handler_run (ioloop=0x16afb60) at ioloop-epoll.c:180 #14 0x00000000004a5d6d in io_loop_run (ioloop=0x16afb60) at ioloop.c:335 #15 0x0000000000428f20 in main (argc=<value optimized out>, argv=<value optimized out>, envp=0x7fff86f840b8) at main.c:327And the full backtrace:
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
Hey Timo, hope you enjoyed your vacation. :) I just wanted to point out that I'm still seeing the (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) dump even after your patch. The (match == IMAP_MATCH_YES) dump is definitely fixed, but the other dump still remains...
On 12/24/2009 11:39 AM, David Halik wrote:
I should probably also post the messages leading up for reference. Note that I did not see any stale NFS messages this time, but did get the usual duplicate file messages:
Dec 24 10:43:07 gehenna11 dovecot: IMAP(user): /rci/nqu/rci/u1/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 4: 1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S (uid 514 -> 721) Dec 24 10:43:08 gehenna11 dovecot: IMAP(user): Maildir /rci/u1/user/Maildir: Expunged message reappeared, giving a new UID (old uid=516, file=1231952516.M164875P24494V03E80009I01B6DCAC_0.gehenna9.rutgers.edu,S=4355:2,S)
Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): /rci/nqu/rci/u1/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 6: 1233524267.M97899P9543V04240006I01B6E908_0.gehenna5.rutgers.edu,S=3039:2,S (uid 517 -> 723) Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): Raw backtrace: /usr/libexec/dovecot/imap [0x49edd0] -> /usr/libexec/dovecot/imap [0x49ee33] -> /usr/libexec/dovecot/imap [0x49e496] -> /usr/libexec/dovecot/imap [0x43f064] -> /usr/libexec/dovecot/imap(maildir_uidlist_refresh+0x2d0) [0x43fbf0] -> /usr/libexec/dovecot/imap [0x43c91f] -> /usr/libexec/dovecot/imap(maildir_storage_sync_init+0x14e) [0x43cefe] -> /usr/libexec/dovecot/imap(imap_sync_init+0x67) [0x4283d7] -> /usr/libexec/dovecot/imap [0x41bf5e] -> /usr/libexec/dovecot/imap [0x458ecd] -> /usr/libexec/dovecot/imap(io_loop_handle_timeouts+0x8b) [0x4a5f6b] -> /usr/libexec/dovecot/imap(io_loop_handler_run+0x73) [0x4a6bd3] -> /usr/libexec/dovecot/imap(io_loop_run+0x1d) [0x4a5d6d] -> /usr/libexec/dovecot/imap(main+0x620) [0x428f20] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x354301d994] -> /usr/libexec/dovecot/imap [0x419ac9] Dec 24 10:44:38 gehenna11 dovecot: dovecot: child 19032 (imap) killed with signal 6 (core dumped)
On 12/24/2009 11:26 AM, David Halik wrote:
Dec 23 11:05:44 gehenna18.rutgers.edu dovecot: IMAP(user1): Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) This is hopefully fixed by: http://hg.dovecot.org/dovecot-1.2/rev/d00abdaf9df2
Timo, I came into work this morning and found three new core dumps from maildir_uidlist_records_array_delete while running the patch you gave me yesterday, so I guess it's not fixed after all. I double checked that it is in fact running your patch, and it is. You can tell also that maildir-uidlist.c: line 403 changed to 405.
Here's the short backtrace:
#0 0x0000003543030265 in raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x0000003543031d10 in abort () at abort.c:88 #2 0x000000000049eddd in default_fatal_finish (type=<value optimized out>, status=0) at failures.c:160 #3 0x000000000049ee33 in i_internal_fatal_handler (type=LOG_TYPE_PANIC, status=0, fmt=<value optimized out>, args=<value optimized out>) at failures.c:443 #4 0x000000000049e496 in i_panic (format=0x6
) at failures.c:207 #5 0x000000000043f064 in maildir_uidlist_records_array_delete (uidlist=<value optimized out>, rec=0x16dbd48) at maildir-uidlist.c:405 #6 0x000000000043fbf0 in maildir_uidlist_refresh (uidlist=0x16c75b0) at maildir-uidlist.c:552 #7 0x000000000043c91f in maildir_sync_context (ctx=0x16a7188, forced=false, find_uid=0x0, lost_files_r=0x7fff86f83d87) at maildir-sync.c:659 #8 0x000000000043cefe in maildir_storage_sync_init (box=0x16b64f8, flags=0) at maildir-sync.c:924 #9 0x00000000004283d7 in imap_sync_init (client=0x16b27a0, box=0x16b64f8, imap_flags=0, flags=0) at imap-sync.c:146 #10 0x000000000041bf5e in idle_sync_now (box=0x4a58, ctx=0x16b3db8) at cmd-idle.c:127 #11 0x0000000000458ecd in check_timeout (ibox=0x16b64f8) at index-mailbox-check.c:43 #12 0x00000000004a5f6b in io_loop_handle_timeouts (ioloop=0x16afb60) at ioloop.c:313 #13 0x00000000004a6bd3 in io_loop_handler_run (ioloop=0x16afb60) at ioloop-epoll.c:180 #14 0x00000000004a5d6d in io_loop_run (ioloop=0x16afb60) at ioloop.c:335 #15 0x0000000000428f20 in main (argc=<value optimized out>, argv=<value optimized out>, envp=0x7fff86f840b8) at main.c:327And the full backtrace:
On Thu, 2009-12-24 at 11:39 -0500, David Halik wrote:
Dec 24 10:43:07 gehenna11 dovecot: IMAP(user): /rci/nqu/rci/u1/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 4: 1231949492.M325236P5922V03E80009I01B6DCA5_0.gehenna9.rutgers.edu,S=4398:2,S (uid 514 -> 721) Dec 24 10:43:08 gehenna11 dovecot: IMAP(user): Maildir /rci/u1/user/Maildir: Expunged message reappeared, giving a new UID (old uid=516, file=1231952516.M164875P24494V03E80009I01B6DCAC_0.gehenna9.rutgers.edu,S=4355:2,S) Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): /rci/nqu/rci/u1/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 6: 1233524267.M97899P9543V04240006I01B6E908_0.gehenna5.rutgers.edu,S=3039:2,S (uid 517 -> 723)
Wonder if there's a corresponding "Expunged message reappeared, giving a new UID (old uid=x)" having "Dupliate file entry .. (uid x -> " for each log line? Meaning that the duplicate file entries are caused by those reappearing messages? (And the reappearing messages are probably caused by dentry caching issues.)
Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
I couldn't previously really figure out how this could happen, even when manually causing the "expunged message reappeared" error. Could you do in gdb something like:
fr 6 (or whatever frame gives usable results) p *rec p count p *recs[0] p *recs[1] p *recs[..up until count-1]
If count is large, the main things I want to know are:
Is the array sorted (recs[n]->uid < recs[n+1]->uid always)?
is rec->uid anywhere in the recs array?
If it complains about values being optimized away, recompile without -O2. I usually do that by just removing it from src/lib-storage/index/maildir/Makefile and touch dovecot-uidlist.c and make + make install.
On 12/29/2009 6:18 PM, Timo Sirainen wrote:
Wonder if there's a corresponding "Expunged message reappeared, giving a new UID (old uid=x)" having "Dupliate file entry .. (uid x -> " for each log line? Meaning that the duplicate file entries are caused by those reappearing messages? (And the reappearing messages are probably caused by dentry caching issues.)
I'm wondering there are multiple causes going on that all spawn the same error. Is some instances, yes I do see the "Duplicate file..." preceding the error, in some instances I don't, but I do get the NFS stale message then. The output above just happens to be from during the duplicate instance.
Dec 24 10:44:38 gehenna11 dovecot: IMAP(user): Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
If it complains about values being optimized away, recompile without -O2. I usually do that by just removing it from src/lib-storage/index/maildir/Makefile and touch dovecot-uidlist.c and make + make install.
I'll definitely get back to you on this. Right now we're closed until after New Years and I don't want to go updating the dovecot package on all of our servers until we're all back at work. I did do some quick poking around and the count is optimized out, so I'll have the package rebuilt without optimization and let you what the values are at the beginning of next week. Thanks again.
On 29.12.2009, at 19.09, David Halik wrote:
I'll definitely get back to you on this. Right now we're closed until after New Years and I don't want to go updating the dovecot package on all of our servers until we're all back at work. I did do some quick poking around and the count is optimized out, so I'll have the package rebuilt without optimization and let you what the values are at the beginning of next week. Thanks again.
well, you can probably also get the values in a bit more difficult way:
p count = p uidlist.records.arr.buffer.used / uidlist.records.arr.element_size
p recs[n] = p *(*uidlist.records.v)[n]
Ok, I think I've got some more info and a more accurate time line for you. I tried this on two different dumps from two different users. The count was 4 in the first example and 0 in the second. I'm guessing that's considered "small"? The links to my gdb sessions for both are below and have some of the info you were looking for. The corresponding logs are also there so you can see how each failed. I put everything on pastebin so it's a little easier to see.
By the way, I also found that the stale NFS file handle message does appear first in each instance, it was just farther back in the logs. "Lowering uid" message also appears immediately after every stale NFS message, which in turn causes all of this n amount of time later (sometimes 5 minutes, sometimes 20) when a user does a new action. The "file reappeared message only occurs some of the time. Here's the chain of events in every case so far that I can see:
- fdatasync(/rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle
- /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: next_uid was lowered (n -> n-1, hdr=n-1) ...a few minutes later... (may or may not be a "message reappeared" warning at this point)
- /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 3: 1261057547.M378185P17303V03E80002I0197FB4A_0.gehenna9.rutgers.edu,S=7174:2,RS (uid i -> n+1,2,3 )
- Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
One thing to note, after the "Expunged message reappeared, giving a new UID" he died quickly and one more than one server simultaneously. The gdb output is from server gehenna11 of that log file. The uid in *recs[0] is also the number that you can see in the logs being lowered from 719 -> 718.
First user log: http://pastebin.com/m1718f07b First user gdb: http://pastebin.com/m40088dc8
The second user also died on more than one server. The output is also from gehenna11
Second user log: http://pastebin.com/f3a1756f2 Second user gdb: http://pastebin.com/m59aacde4
On 12/29/2009 7:50 PM, Timo Sirainen wrote:
On 29.12.2009, at 19.09, David Halik wrote:
I'll definitely get back to you on this. Right now we're closed until after New Years and I don't want to go updating the dovecot package on all of our servers until we're all back at work. I did do some quick poking around and the count is optimized out, so I'll have the package rebuilt without optimization and let you what the values are at the beginning of next week. Thanks again.
well, you can probably also get the values in a bit more difficult way:
p count = p uidlist.records.arr.buffer.used / uidlist.records.arr.element_size
p recs[n] = p *(*uidlist.records.v)[n]
Timo, I reinstalled all our servers with -O0 and full debugging on "-gdwarf-2 -g3". I should be able to look up anything you need done now.
Here's some more examples with the data you asked for last time, now without optimization. I also noticed that there are a few users that seem to be regulars with this problem.
#4 0x00000000004d8fd7 in i_panic (format=0x501688 "file %s: line %d (%s): assertion failed: (%s)") at failures.c:207 args = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffcf52b5a0, reg_save_area = 0x7fffcf52b4e0}} #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x1b612330, rec=0x1b61f1f8) at maildir-uidlist.c:405 recs = (struct maildir_uidlist_rec * const *) 0x1b607f60 pos = (struct maildir_uidlist_rec * const *) 0x0 idx = 0 count = 2 __PRETTY_FUNCTION__ = "maildir_uidlist_records_array_delete" #6 0x000000000044f8fc in maildir_uidlist_next (uidlist=0x1b612330, line=0x1b60426a "1249562340.M302085P7316V04240006I00DB2CF1_0.gehenna7.rutgers.edu,S=19277:2,S") at maildir-uidlist.c:552 rec = (struct maildir_uidlist_rec *) 0x1b6209f8 old_rec = (struct maildir_uidlist_rec *) 0x1b61f1f8 recs = (struct maildir_uidlist_rec * const *) 0x1b604264 count = 0 uid = 2807
(gdb) fr 5 #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x1b612330, rec=0x1b61f1f8) at maildir-uidlist.c:405 405 i_assert(pos != NULL); (gdb) p *recs $1 = (struct maildir_uidlist_rec * const) 0x1b61fa60 (gdb) p count $2 = 2 (gdb) p *recs[0] $3 = {uid = 2735, flags = 6, filename = 0x1b61fa78 "1262739815.M913182P4280V04240006I014D0F92_0.gehenna7.rutgers.edu,S=11791:2,", extensions = 0x1b620980 "W12036"} (gdb) p *recs[1] $4 = {uid = 2736, flags = 0, filename = 0x1b6209a0 "1262739645.M307617P13458V04240006I014D0F91_0.gehenna10.rutgers.edu,S=5005:2,S", extensions = 0x1b6209f0 "W5124"} (gdb)
Another user for comparison:
(gdb) fr 5 #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x6bd7680, rec=0x6bc93d0) at maildir-uidlist.c:405 405 i_assert(pos != NULL); (gdb) p *recs $1 = (struct maildir_uidlist_rec * const) 0x6d5c418 (gdb) p count $2 = 7 (gdb) p *recs[0] $3 = {uid = 52566, flags = 4, filename = 0x6d5c438 "1262781360.M836147P9323V04240007I0148DC53_0.gehenna5.rutgers.edu,S=2157:2,a", extensions = 0x6d5c430 "W2201"} (gdb) p *recs[1] $4 = {uid = 52568, flags = 0, filename = 0x6cd4590 "1262628888.M94168P6001V04240006I003C41C2_0.gehenna10.rutgers.edu,S=22946:2,Sa", extensions = 0x6cd4588 "W23481"} (gdb) p *recs[2] $5 = {uid = 52569, flags = 0, filename = 0x6bc9328 "1041233872.000006.mbox:2,S", extensions = 0x6bc9318 "W6334"} (gdb) p *recs[3] $6 = {uid = 52570, flags = 0, filename = 0x6bc9370 "1041233872.000013.mbox:2,S", extensions = 0x6bc9360 "W1193"} (gdb) p *recs[4] $7 = {uid = 52571, flags = 0, filename = 0x6bc92e0 "1041233872.000014.mbox:2,RS", extensions = 0x6bc92d0 "W3095"} (gdb) p *recs[5] $8 = {uid = 52572, flags = 4, filename = 0x6d58ec8 "1262782141.M730129P15749V03E80006I0148DC54_0.gehenna9.rutgers.edu,S=7454:2,", extensions = 0x6d62160 "W7626"} (gdb) p *recs[6] $9 = {uid = 52573, flags = 0, filename = 0x6d62140 "1041233872.000016.mbox:2,S", extensions = 0x0} (gdb)
And one more user:
(gdb) fr 5 #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x10f44680, rec=0x10f67120) at maildir-uidlist.c:405 405 i_assert(pos != NULL); (gdb) p *recs $1 = (struct maildir_uidlist_rec * const) 0x11093760 (gdb) p count $2 = 2 (gdb) p *recs[0] $3 = {uid = 55108, flags = 0, filename = 0x11093778 "1262788462.M201250P7938V04240006I01EF76E0_0.gehenna10.rutgers.edu,S=1731278:2,", extensions = 0x1109d240 "W1753846"} (gdb) p *recs[1] $4 = {uid = 55109, flags = 4, filename = 0x1109d268 "1262788793.M851477P3866V045C0007I01EF76E3_0.gehenna8.rutgers.edu,S=19990", extensions = 0x0}
On 12/30/2009 12:10 PM, David Halik wrote:
Ok, I think I've got some more info and a more accurate time line for you. I tried this on two different dumps from two different users. The count was 4 in the first example and 0 in the second. I'm guessing that's considered "small"? The links to my gdb sessions for both are below and have some of the info you were looking for. The corresponding logs are also there so you can see how each failed. I put everything on pastebin so it's a little easier to see.
By the way, I also found that the stale NFS file handle message does appear first in each instance, it was just farther back in the logs. "Lowering uid" message also appears immediately after every stale NFS message, which in turn causes all of this n amount of time later (sometimes 5 minutes, sometimes 20) when a user does a new action. The "file reappeared message only occurs some of the time. Here's the chain of events in every case so far that I can see:
- fdatasync(/rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle
- /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: next_uid was lowered (n -> n-1, hdr=n-1) ...a few minutes later... (may or may not be a "message reappeared" warning at this point)
- /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 3: 1261057547.M378185P17303V03E80002I0197FB4A_0.gehenna9.rutgers.edu,S=7174:2,RS (uid i -> n+1,2,3 )
- Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
One thing to note, after the "Expunged message reappeared, giving a new UID" he died quickly and one more than one server simultaneously. The gdb output is from server gehenna11 of that log file. The uid in *recs[0] is also the number that you can see in the logs being lowered from 719 -> 718.
First user log: http://pastebin.com/m1718f07b First user gdb: http://pastebin.com/m40088dc8
The second user also died on more than one server. The output is also from gehenna11
Second user log: http://pastebin.com/f3a1756f2 Second user gdb: http://pastebin.com/m59aacde4
On 12/29/2009 7:50 PM, Timo Sirainen wrote:
On 29.12.2009, at 19.09, David Halik wrote:
I'll definitely get back to you on this. Right now we're closed until after New Years and I don't want to go updating the dovecot package on all of our servers until we're all back at work. I did do some quick poking around and the count is optimized out, so I'll have the package rebuilt without optimization and let you what the values are at the beginning of next week. Thanks again. well, you can probably also get the values in a bit more difficult way:
p count = p uidlist.records.arr.buffer.used / uidlist.records.arr.element_size
p recs[n] = p *(*uidlist.records.v)[n]
--
David Halik System Administrator OIT-CSS Rutgers University dhalik@jla.rutgers.edu
Timo,
On 12/23/09 8:37 AM, "David Halik" dhalik@jla.rutgers.edu wrote:
I switched all of our servers to dotlock_use_excl=no last night, but we're still seeing the errors:
We too have set dotlock_use_excl = no. I'm not seeing the "Stale NFS file handle" message any more, but I am still seeing a crash. The crashes seem to be leaving the indexes in a bad state:
Dec 23 09:07:44 oh-popmap3p dovecot: imap: user=<cnisser>, rip=x.x.x.x, pid=30101: Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL) Dec 23 09:07:44 oh-popmap3p dovecot: imap: user=<cnisser>, rip= x.x.x.x, pid=30101: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) [0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap [0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) [0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x41ccc4] -> imap [0x41cd26] -> imap [0x4733be] -> imap [0x4e4171] -> imap(io_loop_handle_timeouts+0x1d) [0x4e41ce] -> imap(io_loop_handler_run+0x86) [0x4e4f29] -> imap(io_loop_run+0x3b) [0x4e4214] -> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x3217e1d994] -> imap [0x419aa9] Dec 23 09:07:45 oh-popmap3p dovecot: dovecot: child 30101 (imap) killed with signal 6 (core dumped) Dec 23 09:09:16 cc-popmap3p dovecot: imap: user=<cnisser>, rip= x.x.x.x, pid=5975: Corrupted index cache file /home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: invalid record size Dec 23 09:09:17 oh-popmap2p dovecot: imap: user=<cnisser>, rip=y.y.y.y, pid=3279: read() failed with index cache file /home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: Input/output error Dec 23 09:09:38 cc-popmap3p dovecot: imap: user=<cnisser>, rip= x.x.x.x, pid=5975: Corrupted index cache file /home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: invalid record size Dec 23 09:18:12 cc-popmap3p dovecot: imap: user=<cnisser>, rip= x.x.x.x, pid=5975: Corrupted index cache file /home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: invalid record size
We're also seeing another odd error that seems to be unrelated to the crashes, but seemed like it bears reporting. Reading of uidlists and cache files seems to intermittently fail with EIO. It doesn't seem to tie in with anything else, and I don't see any corresponding NFS errors in the system log.
Dec 23 09:31:06 oh-popmap4p dovecot: imap: user=<joet>, rip=a.a.a.a, pid=7641: read(/home6/joet/Maildir/dovecot-uidlist) failed: Input/output error Dec 23 09:53:17 cc-popmap2p dovecot: imap: user=<catm>, rip=b.b.b.b, pid=12840: read(/home3/catm/Maildir/dovecot-uidlist) failed: Input/output error Dec 23 09:59:38 cc-popmap5p dovecot: imap: user=<kforrist>, rip=c.c.c.c, pid=13539: read() failed with index cache file /home15/kforrist/.imapidx/.INBOX/dovecot.index.cache: Input/output error
-Brad
On Wed, 2009-12-23 at 10:37 -0800, Brandon Davidson wrote:
We too have set dotlock_use_excl = no. I'm not seeing the "Stale NFS file handle" message any more, but I am still seeing a crash. The crashes seem to be leaving the indexes in a bad state:
Dec 23 09:07:44 oh-popmap3p dovecot: imap: user=<cnisser>, rip=x.x.x.x, pid=30101: Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
You mean this doesn't get fixed by itself? It's not indexes, but dovecot-uidlist file that's the problem..
pid=7641: read(/home6/joet/Maildir/dovecot-uidlist) failed: Input/output error
These just shouldn't be happening.. Or perhaps that's the same problem as ESTALE.
participants (3)
-
Brandon Davidson
-
David Halik
-
Timo Sirainen