On 31.1.2012, at 18.34, Lauri Alanko wrote:
Well, well, well. This is interesting. Back with the indices created by dsync:
$ doveadm fetch guid all | grep guid: | sort | uniq -c | sort -n | tail 17 guid: 1b28b22d4b2ee2885b5b81221c41201d 17 guid: 730c692395661dd62f82088804b85652 17 guid: 865e1537fddba6698e010d0b9dbddd02 ..
http://hg.dovecot.org/dovecot-2.0/rev/4a0b7dec3a22 avoids force-resync deleting these duplicates. It also logs a warning about the duplicates.
http://hg.dovecot.org/dovecot-2.1/rev/2500de8f1f51 implements mbox_md5=all setting which avoids creation of these duplicates in the first place. I thought about adding some duplicate detection also to dsync (or anywhere in its path), but I couldn't do it without impacting performance in normal operation.
The complexity and opaqueness of the mdbox format is a worrisome. It would ease my mind quite a bit if there were a simple tool that would just dump out the plain message contents that are stored inside the storage files, without involving any of dovecot's index machinery. Then I would at least know that whatever happens, as long as the storage files stay intact, I can always migrate my mails into some other format.
By using Dovecot indexes you could use e.g. "doveadm fetch" to dump them. Also "doveadm dump" can dump the dbox files' metadata, but not the message contents themselves. It probably wouldn't be difficult to implement that though. Also alternatively you could build something based on http://dovecot.org/tools/mdbox-obfuscate.pl