On 25.6.2013, at 14.14, Charles Marcus CMarcus@Media-Brokers.com wrote:
+ doveadm: Added "deduplicate" command to expunge message duplicates.
Hey Timo,
2 questions on this new 'deduplicate' capability of doveadm...
Obviously this could be scripted with a cron job, but I was wondering if it wouldn't make sense to do this automatically whenever messages are being moved around in the mailstore?
An interesting 'feature' of gmail is that if/when you are copying lots of messages from a non gmail account to a gmail account through IMAP, if the folder you are copying from contains duplicate messages, gmail will silently discard the duplicates after the first one is successfully copied up...
I discovered this a long time ago the first time I encountered an anomaly where I copied an entire folder, but the number of messages on the gmail account didn't match the number in the source folder. After comparing, I discovered that there were duplicates in the source folder, which accounted for the discrepancy.
There's currently no efficient way to do that automatically in Dovecot. Also there are several potential problems.. Like if there are duplicate Message-ID: headers, but the body is different, should that be a duplicate? What if the body is the same but headers differ with e.g. the Subject line (maybe it's just [Dovecot] prefix)? What if only the Received: headers are different? And so on..
Anyway, copy&pasting what I just wrote to another reply about doveadm deduplicate:
The main idea behind it is to be able to revert some (more or less) accidental duplication of emails due to something that admin did, or possibly due to some bug in Dovecot (e.g. dsync). There are two modes of operation, both work only for duplicates within the same folder:
Deduplicate by message GUID. These duplicates could have only been caused by copying the mail (IMAP COPY, doveadm copy) or by "doveadm import" that imports old messages from e.g. a backup.
Deduplicate by Message-Id: header (-m parameter). I added this just because some people had asked for it previously. I'm not sure how/when it's actually useful.