Non-destructive deduplication

Tristan Miller psychonaut at nothingisreal.com
Mon Jan 15 16:46:45 EET 2018


Greetings.

On Sun, 11 Jun 2017 08:25:33 +0300 (EEST), Aki Tuomi
<aki.tuomi at dovecot.fi> wrote:
> > On June 10, 2017 at 10:30 PM Tristan Miller
> > <psychonaut at nothingisreal.com> wrote:
> > 
> > I use Dovecot 2.2.29.1 as my IMAP server.  Owing to a bug in my mail
> > client [1], several unique messages (mostly in my Sent folder) have
> > duplicate Message-ID headers.  Dovecot itself doesn't seem to be
> > bothered by this, though my mail client is confused by the false
> > duplicates.  (It screws up the threading display, and results in
> > data loss when I run the client's deduplication filter.)
> > 
> > I would like to change the Message-ID headers in the "duplicate"
> > message files stored in my IMAP server so that they are unique.  I
> > realize that this has disadvantages of its own, but I can't think
> > of a better alternative.
> > 
> > So this leads me to two questions:
> > 
> > 1) Will Dovecot get confused if I simply open the message files in a
> > text editor and manually change their Message-ID headers?  If so,
> > how should I go about changing the Message-ID headers?
> 
> Expunge the mails, then edit message-id's and import them back.

I finally got around to this and thought I'd report on my experiences
in case anyone else wants to do the same thing.

From a Bash shell, I made a backup copy of the entire IMAP folder
containing the messages with duplicate Message-ID headers.  I then used
"doveadm deduplicate" with the -m option to deduplicate the original
folder.  This didn't produce any logging output (even with -v), so from
both the original and backup folders, I ran ls and redirected the
output to a file, then used comm to find out which files had been
deleted.  I copied those files to a temporary directory and then
rewrote the Message-ID headers as follows:

for f in *; do
sed -i -e 's/^Message-ID: </Message-ID: <dedup.'$(pwgen 3)'./' "$f"
done

So far, so good.  The biggest problem was re-importing the revised
messages.  The usual "doveadm import" refuses to process the revised
messages, throwing a bunch of "Cached message size larger than expected"
errors.  I eventually realized that this is because the size of the
message is encoded in the filename, and dovecot actually reads this
size and compares it against the actual size of the message.  Unless
the Message-ID headers are replaced with ones of identical length, the
actual size of the new message will be different from what's encoded
in the filename.

The Dovecot website provides a Perl script for
recomputing the message size and renaming the files accordingly:
<https://www.dovecot.org/tools/maildir-size-fix.pl>  Running this
script threw an error about trying to pass "kill" a non-numeric
argument, but it seemed to work anyway, and I was finally able to run
"doveadm import".

Regards,
Tristan

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
                  Tristan Miller
Free Software developer, ferret herder, logologist
             https://logological.org/
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


More information about the dovecot mailing list