[Dovecot] force-resync fails to recover all messages in mdbox
To my understanding, when using mdbox, doveadm force-resync should be able to recover all the messages from the storage files alone, though of course losing all metadata except the initial delivery folder.
However, this does not seem to be the case. For me, force-resync creates only partial indices that lose messages. The message contents are of course still in the storage files, but dovecot just doesn't seem to be aware of some of them after recreating the indices.
Here is an example. I created a test mdbox by syncing a mailing list folder from a mbox location:
$ dsync -m haskell-cafe backup mdbox:~/dbox
Then I switched the location to the new mdbox:
$ /usr/sbin/dovecot -n # 2.0.15: /etc/dovecot/dovecot.conf # OS: Linux 3.2.0-0.bpo.1-amd64 x86_64 Debian wheezy/sid mail_fsync = never mail_location = mdbox:~/dbox mail_plugins = zlib passdb { driver = pam } plugin { sieve = ~/etc/sieve/dovecot.sieve sieve_dir = ~/etc/sieve zlib_save = bz2 zlib_save_level = 9 } protocols = " imap" ssl_cert =
Then I checked the number of messages we had in the new location:
$ doveadm search all | wc 93236 186472 3625098
This was indeed the correct number, so the sync was fine:
$ grep -c '^From .*..:..:.. ....$' ../mbox/haskell-cafe 93237
Then I removed all the indices and rebuilt them:
$ mkdir bak $ mv mailboxes bak $ mv storage/dovecot.map.index* bak/ $ doveadm force-resync inbox doveadm(la): Warning: mdbox /home/la/dbox/storage: rebuilding indexes
And what have we now:
$ doveadm search all | wc 43864 87728 1699590
Somehow dovecot lost over half of the messages!
This is really worrisome. It should always be possible to extract the plain mail content out from the storage, but currently it only _looks_ like it is possible: you get a problem in the indices, rebuild them, and then later migrate somewhere else. If you don't notice that you are missing mails, you may then eventually destroy the original mdbox storage files, thinking that their content is now safely elsewhere, when it really isn't.
Lauri
On 31.1.2012, at 17.48, Lauri Alanko wrote:
$ doveadm search all | wc 93236 186472 3625098 .. Then I removed all the indices and rebuilt them:
$ doveadm search all | wc 43864 87728 1699590
Somehow dovecot lost over half of the messages!
There may be a bug, and I just yesterday noticed something weird in the rebuilding code. I'll have to look into that. But anyway, "search all" isn't the proper way to test this. Try instead with:
doveadm fetch guid all | sort | uniq | wc
When you removed indexes Dovecot no longer knew about copies of messages.
Quoting "Timo Sirainen" tss@iki.fi:
Try instead with:
doveadm fetch guid all | sort | uniq | wc
When you removed indexes Dovecot no longer knew about copies of messages.
Well, well, well. This is interesting. Back with the indices created by dsync:
$ doveadm fetch guid all | grep guid: | sort | uniq -c | sort -n | tail 17 guid: 1b28b22d4b2ee2885b5b81221c41201d 17 guid: 730c692395661dd62f82088804b85652 17 guid: 865e1537fddba6698e010d0b9dbddd02 17 guid: d271b6ba8af0e7fa39c16ea8ed13abcf 17 guid: d2cd391e837cf51cc85991bde814dc54 17 guid: ebce8373da6ffb134b58aca7906d61f1 18 guid: 1222b6c222ecb53fdbbec407400cba36 18 guid: 65695586efc69adc2d7294216ea88e55 19 guid: 4288f61ebbdcd44870c670439a97693b 20 guid: 080ec72aa49e2a01c8e249fe127605f6
This would explain why rebuilding the indices reduced the number of messages. However, those guid assignments seem really weird, because:
$ doveadm fetch hdr guid 080ec72aa49e2a01c8e249fe127605f6 | grep -i
'^Message-ID: '
Message-ID: 4B1ACA53.7040503@rkit.pp.ru
Message-ID: 29bf512f0912051251u74d246afxafdfb9e5ea24342c@mail.gmail.com
Message-ID: 5e0214850912051300r3ebd0e44n61a4d6e020c94f4c@mail.gmail.com
Message-ID: 4B1ACD40.3040507@btinternet.com
Message-Id: 200912052220.00317.daniel.is.fischer@web.de
Message-Id: 200912052225.28597.daniel.is.fischer@web.de
Message-ID: 20091205212848.GA23711@seas.upenn.edu
Message-Id: 200912051336.13792.hgolden@socal.rr.com
Message-Id: 200912052243.03144.daniel.is.fischer@web.de
Message-Id: 0B59A706-8C41-47B9-A858-5ACE297581E1@cs.uu.nl
Message-ID: 20091205215707.GA6161@protagoras.phil.berkeley.edu
Message-ID: 471726.55822.qm@web113106.mail.gq1.yahoo.com
Message-ID: 4B1AD7FB.8050704@btinternet.com
Message-ID: 5fdc56d70912051400h663a25a9w4f9b2e065a5b395e@mail.gmail.com
Message-Id: 1B613EE3-B4F8-4F6E-8A36-74BACF0C86FC@yandex.ru
Message-ID: 4B1ADA0E.5070207@btinternet.com
Message-Id: 36C40624-B050-4A8C-8CAF-F15D84467180@phys.washington.edu
Message-ID: SNT119-W313697775F905AE968566CC6920@phx.gbl
Message-id:
alpine.DEB.2.00.0912052309170.31599@anubis.informatik.uni-halle.de
Message-ID: 29bf512f0912051423safd7842ka39c8b8b6dee1ac0@mail.gmail.com
So all these completely unrelated messages have somehow received the same guid? And that guid is stored even in the storage files themselves so they cannot be cleaned up even with force-resync? Something is _seriously_ wrong.
The complexity and opaqueness of the mdbox format is a worrisome. It would ease my mind quite a bit if there were a simple tool that would just dump out the plain message contents that are stored inside the storage files, without involving any of dovecot's index machinery. Then I would at least know that whatever happens, as long as the storage files stay intact, I can always migrate my mails into some other format.
Lauri
On 31.1.2012, at 18.34, Lauri Alanko wrote:
Well, well, well. This is interesting. Back with the indices created by dsync:
$ doveadm fetch guid all | grep guid: | sort | uniq -c | sort -n | tail 17 guid: 1b28b22d4b2ee2885b5b81221c41201d 17 guid: 730c692395661dd62f82088804b85652 17 guid: 865e1537fddba6698e010d0b9dbddd02 ..
http://hg.dovecot.org/dovecot-2.0/rev/4a0b7dec3a22 avoids force-resync deleting these duplicates. It also logs a warning about the duplicates.
http://hg.dovecot.org/dovecot-2.1/rev/2500de8f1f51 implements mbox_md5=all setting which avoids creation of these duplicates in the first place. I thought about adding some duplicate detection also to dsync (or anywhere in its path), but I couldn't do it without impacting performance in normal operation.
The complexity and opaqueness of the mdbox format is a worrisome. It would ease my mind quite a bit if there were a simple tool that would just dump out the plain message contents that are stored inside the storage files, without involving any of dovecot's index machinery. Then I would at least know that whatever happens, as long as the storage files stay intact, I can always migrate my mails into some other format.
By using Dovecot indexes you could use e.g. "doveadm fetch" to dump them. Also "doveadm dump" can dump the dbox files' metadata, but not the message contents themselves. It probably wouldn't be difficult to implement that though. Also alternatively you could build something based on http://dovecot.org/tools/mdbox-obfuscate.pl
participants (2)
-
Lauri Alanko
-
Timo Sirainen