Timo, I reinstalled all our servers with -O0 and full debugging on "-gdwarf-2 -g3". I should be able to look up anything you need done now.
Here's some more examples with the data you asked for last time, now without optimization. I also noticed that there are a few users that seem to be regulars with this problem.
#4 0x00000000004d8fd7 in i_panic (format=0x501688 "file %s: line %d (%s): assertion failed: (%s)") at failures.c:207 args = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7fffcf52b5a0, reg_save_area = 0x7fffcf52b4e0}} #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x1b612330, rec=0x1b61f1f8) at maildir-uidlist.c:405 recs = (struct maildir_uidlist_rec * const *) 0x1b607f60 pos = (struct maildir_uidlist_rec * const *) 0x0 idx = 0 count = 2 __PRETTY_FUNCTION__ = "maildir_uidlist_records_array_delete" #6 0x000000000044f8fc in maildir_uidlist_next (uidlist=0x1b612330, line=0x1b60426a "1249562340.M302085P7316V04240006I00DB2CF1_0.gehenna7.rutgers.edu,S=19277:2,S") at maildir-uidlist.c:552 rec = (struct maildir_uidlist_rec *) 0x1b6209f8 old_rec = (struct maildir_uidlist_rec *) 0x1b61f1f8 recs = (struct maildir_uidlist_rec * const *) 0x1b604264 count = 0 uid = 2807
(gdb) fr 5 #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x1b612330, rec=0x1b61f1f8) at maildir-uidlist.c:405 405 i_assert(pos != NULL); (gdb) p *recs $1 = (struct maildir_uidlist_rec * const) 0x1b61fa60 (gdb) p count $2 = 2 (gdb) p *recs[0] $3 = {uid = 2735, flags = 6, filename = 0x1b61fa78 "1262739815.M913182P4280V04240006I014D0F92_0.gehenna7.rutgers.edu,S=11791:2,", extensions = 0x1b620980 "W12036"} (gdb) p *recs[1] $4 = {uid = 2736, flags = 0, filename = 0x1b6209a0 "1262739645.M307617P13458V04240006I014D0F91_0.gehenna10.rutgers.edu,S=5005:2,S", extensions = 0x1b6209f0 "W5124"} (gdb)
Another user for comparison:
(gdb) fr 5 #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x6bd7680, rec=0x6bc93d0) at maildir-uidlist.c:405 405 i_assert(pos != NULL); (gdb) p *recs $1 = (struct maildir_uidlist_rec * const) 0x6d5c418 (gdb) p count $2 = 7 (gdb) p *recs[0] $3 = {uid = 52566, flags = 4, filename = 0x6d5c438 "1262781360.M836147P9323V04240007I0148DC53_0.gehenna5.rutgers.edu,S=2157:2,a", extensions = 0x6d5c430 "W2201"} (gdb) p *recs[1] $4 = {uid = 52568, flags = 0, filename = 0x6cd4590 "1262628888.M94168P6001V04240006I003C41C2_0.gehenna10.rutgers.edu,S=22946:2,Sa", extensions = 0x6cd4588 "W23481"} (gdb) p *recs[2] $5 = {uid = 52569, flags = 0, filename = 0x6bc9328 "1041233872.000006.mbox:2,S", extensions = 0x6bc9318 "W6334"} (gdb) p *recs[3] $6 = {uid = 52570, flags = 0, filename = 0x6bc9370 "1041233872.000013.mbox:2,S", extensions = 0x6bc9360 "W1193"} (gdb) p *recs[4] $7 = {uid = 52571, flags = 0, filename = 0x6bc92e0 "1041233872.000014.mbox:2,RS", extensions = 0x6bc92d0 "W3095"} (gdb) p *recs[5] $8 = {uid = 52572, flags = 4, filename = 0x6d58ec8 "1262782141.M730129P15749V03E80006I0148DC54_0.gehenna9.rutgers.edu,S=7454:2,", extensions = 0x6d62160 "W7626"} (gdb) p *recs[6] $9 = {uid = 52573, flags = 0, filename = 0x6d62140 "1041233872.000016.mbox:2,S", extensions = 0x0} (gdb)
And one more user:
(gdb) fr 5 #5 0x000000000044f34e in maildir_uidlist_records_array_delete (uidlist=0x10f44680, rec=0x10f67120) at maildir-uidlist.c:405 405 i_assert(pos != NULL); (gdb) p *recs $1 = (struct maildir_uidlist_rec * const) 0x11093760 (gdb) p count $2 = 2 (gdb) p *recs[0] $3 = {uid = 55108, flags = 0, filename = 0x11093778 "1262788462.M201250P7938V04240006I01EF76E0_0.gehenna10.rutgers.edu,S=1731278:2,", extensions = 0x1109d240 "W1753846"} (gdb) p *recs[1] $4 = {uid = 55109, flags = 4, filename = 0x1109d268 "1262788793.M851477P3866V045C0007I01EF76E3_0.gehenna8.rutgers.edu,S=19990", extensions = 0x0}
On 12/30/2009 12:10 PM, David Halik wrote:
Ok, I think I've got some more info and a more accurate time line for you. I tried this on two different dumps from two different users. The count was 4 in the first example and 0 in the second. I'm guessing that's considered "small"? The links to my gdb sessions for both are below and have some of the info you were looking for. The corresponding logs are also there so you can see how each failed. I put everything on pastebin so it's a little easier to see.
By the way, I also found that the stale NFS file handle message does appear first in each instance, it was just farther back in the logs. "Lowering uid" message also appears immediately after every stale NFS message, which in turn causes all of this n amount of time later (sometimes 5 minutes, sometimes 20) when a user does a new action. The "file reappeared message only occurs some of the time. Here's the chain of events in every case so far that I can see:
- fdatasync(/rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist) failed: Stale NFS file handle
- /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: next_uid was lowered (n -> n-1, hdr=n-1) ...a few minutes later... (may or may not be a "message reappeared" warning at this point)
- /rci/nqu/rci/u8/user/dovecot/.INBOX/dovecot-uidlist: Duplicate file entry at line 3: 1261057547.M378185P17303V03E80002I0197FB4A_0.gehenna9.rutgers.edu,S=7174:2,RS (uid i -> n+1,2,3 )
- Panic: file maildir-uidlist.c: line 405 (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
One thing to note, after the "Expunged message reappeared, giving a new UID" he died quickly and one more than one server simultaneously. The gdb output is from server gehenna11 of that log file. The uid in *recs[0] is also the number that you can see in the logs being lowered from 719 -> 718.
First user log: http://pastebin.com/m1718f07b First user gdb: http://pastebin.com/m40088dc8
The second user also died on more than one server. The output is also from gehenna11
Second user log: http://pastebin.com/f3a1756f2 Second user gdb: http://pastebin.com/m59aacde4
On 12/29/2009 7:50 PM, Timo Sirainen wrote:
On 29.12.2009, at 19.09, David Halik wrote:
I'll definitely get back to you on this. Right now we're closed until after New Years and I don't want to go updating the dovecot package on all of our servers until we're all back at work. I did do some quick poking around and the count is optimized out, so I'll have the package rebuilt without optimization and let you what the values are at the beginning of next week. Thanks again. well, you can probably also get the values in a bit more difficult way:
p count = p uidlist.records.arr.buffer.used / uidlist.records.arr.element_size
p recs[n] = p *(*uidlist.records.v)[n]