Timo Sirainen wrote:
On Wed, 2009-09-30 at 11:06 +0100, Chris Wakelin wrote:
Sep 30 09:57:11 IMAP 25384 xxxxxxxx 134.225.xx.xx : Panic: file mail-index-sync-update.c: line 933 (mail_index_sync_map): assertion failed: (map->hdr.indexid == index->indexid || map->hdr.indexid == 0) .. #5 0x98a00 in mail_index_sync_map (_map=0x1263e4, type=MAIL_INDEX_SYNC_HANDLER_HEAD, force=false) at mail-index-sync-update.c:802
Could you also do:
fr 5 p *index p *map
OK:
(gdb) fr 5 #5 0x98a00 in mail_index_sync_map (_map=0x1263e4, type=MAIL_INDEX_SYNC_HANDLER_HEAD, force=false) at mail-index-sync-update.c:802 802 return 0; (gdb) p *index $1 = {dir = 0x1264f8 "/export/indexes/a/120/xxxxxxxx/.imap/INBOX", prefix = 0x11add0 "dovecot.index", cache = 0x127930, log = 0x1271f0, fsync_mask = 0, mode = 384, gid = -1, gid_origin = 0x1278b8 "/export/mail/a/120/xxxxxxxx/INBOX", extension_pool = 0x126540, extensions = {arr = {buffer = 0x126558, element_size = 36}, v = 0x126558, v_modifiable = 0x126558}, sync_lost_handlers = {arr = {buffer = 0x11b578, element_size = 4}, v = 0x11b578, v_modifiable = 0x11b578}, filepath = 0x1278e8 "/export/indexes/a/120/xxxxxxxx/.imap/INBOX/dovecot.index", fd = -1, map = 0x127ac0, indexid = 1254301030, inconsistency_id = 0, last_read_log_file_seq = 0, last_read_log_file_head_offset = 0, last_read_log_file_tail_offset = 0, last_read_stat = {st_dev = 0, st_pad1 = {0, 0, 0}, st_ino = 0, st_mode = 0, st_nlink = 0, st_uid = 0, st_gid = 0, st_rdev = 0, st_pad2 = {0, 0}, st_size = 0, st_atim = {tv_sec = 0, tv_nsec = 0}, st_mtim = {tv_sec = 0, tv_nsec = 0}, st_ctim = {tv_sec = 0, tv_nsec = 0}, st_blksize = 0, st_blocks = 0, st_fstype = '\000'
, st_pad4 = {0, 0, 0, 0, 0, 0, 0, 0}}, fsck_log_head_file_seq = 0, fsck_log_head_file_offset = 0, lock_type = 3, shared_lock_count = 0, excl_lock_count = 0, lock_id_counter = 2, lock_method = FILE_LOCK_METHOD_FCNTL, file_lock = 0x0, dotlock = 0x0, keywords_pool = 0x126948, keywords = {arr = {buffer = 0x11b5c8, element_size = 4}, v = 0x11b5c8, v_modifiable = 0x11b5c8}, keywords_hash = 0x126b88, keywords_ext_id = 0, modseq_ext_id = 1, module_contexts = {arr = {buffer = 0x11b5a0, element_size = 4}, v = 0x11b5a0, v_modifiable = 0x11b5a0}, error = 0x0, nodiskspace = 0, index_lock_timeout = 0, opened = 1, log_locked = 0, mmap_disable = 0, fsync_disable = 0, use_excl_dotlocks = 1, nfs_flush = 0, readonly = 0, mapping = 1, syncing = 0, need_recreate = 0, index_min_write = 1, modseqs_enabled = 0, initial_create = 0, initial_mapped = 1} (gdb) p *map $2 = {index = 0x1263a8, refcount = 2, hdr = {major_version = 7 '\a', minor_version = 2 '\002', base_header_size = 120, header_size = 120, record_size = 8, compat_flags = 0 '\000', unused = "\000\000", indexid = 1254301031, flags = 0, uid_validity = 0, next_uid = 1, messages_count = 0, unused_old_recent_messages_count = 0, seen_messages_count = 0, deleted_messages_count = 0, first_recent_uid = 1, first_unseen_uid_lowwater = 0, first_deleted_uid_lowwater = 0, log_file_seq = 2, log_file_tail_offset = 0, log_file_head_offset = 40, sync_size = 0, sync_stamp = 0, day_stamp = 0, day_first_uid = {0, 0, 0, 0, 0, 0, 0, 0}}, hdr_base = 0x127fc0, hdr_copy_buf = 0x11b640, extension_pool = 0x0, extensions = {arr = {buffer = 0x0, element_size = 0}, v = 0x0, v_modifiable = 0x0}, ext_id_map = {arr = {buffer = 0x0, element_size = 0}, v = 0x0, v_modifiable = 0x0}, keyword_idx_map = {arr = {buffer = 0x0, element_size = 0}, v = 0x0, v_modifiable = 0x0}, rec_map = 0x127b78, write_base_header = 0, write_ext_header = 0, write_atomic = 0}
I was wrong about it being Solaris 10; the mail server is still on Solaris 8 (with the latest and last patch cluster installed) and Dovecot was compiled with gcc 3.3.2.
I've got a backup of the user's INBOX available, but I haven't been able to reproduce the crash on my test box (either with it put through mboxcrypt.pl or without). I've seen no similar crashes (so far) either. Let me know if you'd like me to send the anonymised version (4MB expands to ~70MB).
Sep 30 10:37:06 auth(default): Error: Master request 15820.607 not found
Either timed out for some reason or some other strange bug..
Sep 30 10:51:27 dovecot: Error: child 29959 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) .. How do I get the login (presumably imap-login) processes to dump core
With Linux this would be enough:
login_executable = .../imap-login -D
With Solaris you'll also have to somehow make setuid binaries dump core. http://publib.boulder.ibm.com/httpserv/ihsdiag/coredumps.html says you could do:
coreadm -e global-setid -e proc-setid -e global
Perhaps with those global core dumps it wouldn't even be necessary to add the -D parameter. Anyway the -D's point was to make the /var/run/dovecot/login/ directory writable to it, so it could write core dump there.
(or do I need to insert gdbhelper)?
I'm not really sure anymore if that even works (anymore). :)
...
Sep 30 10:51:27 dovecot: Error: child 29959 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) .. passdb: driver: passwd-file args: /export/mailconfig/%Ju/%u/proxy.%Ls
You're using proxying also? Apparently there's something wrong there, perhaps with master user logins or something.. I haven't yet found out what exactly it is, but these should fix the crashes and log more information:
http://hg.dovecot.org/dovecot-1.2/rev/b1293abca645 http://hg.dovecot.org/dovecot-1.2/rev/c8e24c1b8468
Yes we're proxying to Exchange 2003 for those users that have a proxy.imap or proxy.pop3 in their mailconfig directory (we migrated most of our staff from Unix mailstore to Exchange without them noticing; all the students, most functional accounts, and staff that have too much mail are still on Unix).
I'll look at applying the patches this evening (there's probably no point doing it on the test server as I guess it's triggered by load or something odd in Exchange).
Strangely, though we haven't had any crashes since this morning:
Sep 30 00:38:32 dovecot: Error: child 26387 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) Sep 30 00:55:02 dovecot: Error: child 9268 (login) killed with signal 11 (core not dumped) (ip=134.225.1.46) Sep 30 09:43:07 dovecot: Error: child 13660 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) Sep 30 10:51:27 dovecot: Error: child 29959 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) Sep 30 10:51:38 dovecot: Error: child 11162 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) Sep 30 11:18:05 dovecot: Error: child 23772 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46) Sep 30 11:20:16 dovecot: Error: child 2120 (login) killed with signal 11 (core not dumped) (latest ip=134.225.1.46)
and all of them had latest ip=134.225.1.46, which is our Prayer Webmail server and is used mostly by students (i.e. not people on Exchange).
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094