Non-destructive deduplication
Greetings.
I use Dovecot 2.2.29.1 as my IMAP server. Owing to a bug in my mail client [1], several unique messages (mostly in my Sent folder) have duplicate Message-ID headers. Dovecot itself doesn't seem to be bothered by this, though my mail client is confused by the false duplicates. (It screws up the threading display, and results in data loss when I run the client's deduplication filter.)
I would like to change the Message-ID headers in the "duplicate" message files stored in my IMAP server so that they are unique. I realize that this has disadvantages of its own, but I can't think of a better alternative.
So this leads me to two questions:
Will Dovecot get confused if I simply open the message files in a text editor and manually change their Message-ID headers? If so, how should I go about changing the Message-ID headers?
Does there exist any tool that will scan a maildir folder to find messages with duplicate Message-IDs, and automatically rewrite the Message-ID headers so that they are (probably) unique? I could probably kludge together a quick shell script, but if someone's already gone to the trouble of writing such a tool and making it relatively robust, I'd rather use that instead.
Regards, Tristan
[1] http://www.thewildbeast.co.uk/claws-mail/bugzilla/show_bug.cgi?id=3828
--
Tristan Miller
Free Software developer, ferret herder, logologist https://logological.org/
On June 10, 2017 at 10:30 PM Tristan Miller <psychonaut@nothingisreal.com> wrote:
Greetings.
I use Dovecot 2.2.29.1 as my IMAP server. Owing to a bug in my mail client [1], several unique messages (mostly in my Sent folder) have duplicate Message-ID headers. Dovecot itself doesn't seem to be bothered by this, though my mail client is confused by the false duplicates. (It screws up the threading display, and results in data loss when I run the client's deduplication filter.)
I would like to change the Message-ID headers in the "duplicate" message files stored in my IMAP server so that they are unique. I realize that this has disadvantages of its own, but I can't think of a better alternative.
So this leads me to two questions:
- Will Dovecot get confused if I simply open the message files in a text editor and manually change their Message-ID headers? If so, how should I go about changing the Message-ID headers?
Expunge the mails, then edit message-id's and import them back.
- Does there exist any tool that will scan a maildir folder to find messages with duplicate Message-IDs, and automatically rewrite the Message-ID headers so that they are (probably) unique? I could probably kludge together a quick shell script, but if someone's already gone to the trouble of writing such a tool and making it relatively robust, I'd rather use that instead.
https://wiki2.dovecot.org/Tools/Doveadm/Deduplicate should let you expunge duplicate messages.
Regards, Tristan
[1] http://www.thewildbeast.co.uk/claws-mail/bugzilla/show_bug.cgi?id=3828
--
Tristan Miller
Free Software developer, ferret herder, logologist https://logological.org/
Greetings.
On Sun, 11 Jun 2017 08:25:33 +0300 (EEST), Aki Tuomi <aki.tuomi@dovecot.fi> wrote:
- Does there exist any tool that will scan a maildir folder to find messages with duplicate Message-IDs, and automatically rewrite the Message-ID headers so that they are (probably) unique? I could probably kludge together a quick shell script, but if someone's already gone to the trouble of writing such a tool and making it relatively robust, I'd rather use that instead.
https://wiki2.dovecot.org/Tools/Doveadm/Deduplicate should let you expunge duplicate messages.
Yeah, simply removing duplicates is pretty trivial, and any decent mail client will also have a nice way of doing this. The tricky part in my case is not removing the duplicates, but preserving them and giving them unique Message-IDs. If there's no existing way of automating this then I'll do as you suggest.
Regards, Tristan
--
Tristan Miller
Free Software developer, ferret herder, logologist https://logological.org/
Greetings.
On Sun, 11 Jun 2017 08:25:33 +0300 (EEST), Aki Tuomi <aki.tuomi@dovecot.fi> wrote:
On June 10, 2017 at 10:30 PM Tristan Miller <psychonaut@nothingisreal.com> wrote:
I use Dovecot 2.2.29.1 as my IMAP server. Owing to a bug in my mail client [1], several unique messages (mostly in my Sent folder) have duplicate Message-ID headers. Dovecot itself doesn't seem to be bothered by this, though my mail client is confused by the false duplicates. (It screws up the threading display, and results in data loss when I run the client's deduplication filter.)
I would like to change the Message-ID headers in the "duplicate" message files stored in my IMAP server so that they are unique. I realize that this has disadvantages of its own, but I can't think of a better alternative.
So this leads me to two questions:
- Will Dovecot get confused if I simply open the message files in a text editor and manually change their Message-ID headers? If so, how should I go about changing the Message-ID headers?
Expunge the mails, then edit message-id's and import them back.
I finally got around to this and thought I'd report on my experiences in case anyone else wants to do the same thing.
From a Bash shell, I made a backup copy of the entire IMAP folder containing the messages with duplicate Message-ID headers. I then used "doveadm deduplicate" with the -m option to deduplicate the original folder. This didn't produce any logging output (even with -v), so from both the original and backup folders, I ran ls and redirected the output to a file, then used comm to find out which files had been deleted. I copied those files to a temporary directory and then rewrote the Message-ID headers as follows:
for f in *; do sed -i -e 's/^Message-ID: </Message-ID: <dedup.'$(pwgen 3)'./' "$f" done
So far, so good. The biggest problem was re-importing the revised messages. The usual "doveadm import" refuses to process the revised messages, throwing a bunch of "Cached message size larger than expected" errors. I eventually realized that this is because the size of the message is encoded in the filename, and dovecot actually reads this size and compares it against the actual size of the message. Unless the Message-ID headers are replaced with ones of identical length, the actual size of the new message will be different from what's encoded in the filename.
The Dovecot website provides a Perl script for recomputing the message size and renaming the files accordingly: <https://www.dovecot.org/tools/maildir-size-fix.pl> Running this script threw an error about trying to pass "kill" a non-numeric argument, but it seemed to work anyway, and I was finally able to run "doveadm import".
Regards, Tristan
--
Tristan Miller
Free Software developer, ferret herder, logologist https://logological.org/
participants (2)
-
Aki Tuomi
-
Tristan Miller