[Dovecot] Deduplicate not processing all messages - bug?
    Jiří Bourek 
    bourek at thinline.cz
       
    Tue Apr  1 08:48:15 UTC 2014
    
    
  
Judging from lack of replies I guess either not many people use the 
feature, or it's supposed to work this way.
After a bit of more research I realized repeated calls of doveadm 
deduplicate won't be very reliable - the cycle is prone to be 
interrupted prematurely in a busy mailbox (if deduplicate removes x 
messages and x new messages arrive into the mailbox, it seems like 
nothing was done and the cycle interrupts.)
Solving this requires to know more details about the contents of the 
mailbox, which leads to avoiding deduplicate altogether. I'm thinking 
along the lines of using doveadm fetch to get guid, date.saved, 
mailbox-guid and uid fields - find duplicates in guid, preserve the 
message with oldes date.saved, doveadm expunge the rest using 
mailbox-guid and uid.
I'll probably be duplicating most of doveadm deduplicate, but in the end 
it should prove more reliable.
Just my 2 cents in case someone else runs into this issue.
Jiri Bourek wrote:
> Hello,
>
> I'm trying to create automated backup recovery using "doveadm import"
> and "doveadm deduplicate". During testing I noticed that deduplicate
> only deletes some duplicates and has to be called multiple times to find
> them all. Here's what I've been trying (in shell commands):
>
> First, expunge inbox (the end result is the same even if you delete only
> some messages):
>
> # doveadm expunge -u test mailbox inbox all
> # ls /home/mailboxes/test/cur | wc -l
> 0
>
> Then import data from backup - twice, so duplicates are created (again,
> if you don't delete all messages and call import only once, resulting
> behaviour is the same.)
>
> # doveadm import -u test maildir:/home/test "" mailbox INBOX
> # doveadm import -u test maildir:/home/test "" mailbox INBOX
> # ls /home/mailboxes/test/cur | wc -l
> 1046
>
> Then try to deduplicate
>
> # doveadm deduplicate -u test mailbox INBOX
> # ls /home/mailboxes/test/cur | wc -l
> 1040
>
> And again
>
> # doveadm deduplicate -u test mailbox INBOX
> # ls /home/mailboxes/test/cur | wc -l
> 1029
>
> And so on until the message count holds on 523
>
> Each repetition removes 10 - 30 duplicates so eventually all duplicates
> are removed if "doveadm deduplicate" is called enough times in a row. I
> also noticed that when I repeat the test, import the backup again and
> call deduplicate, the steps - how many messages are removed at one time
> - are the same. That is I start with 1046 messages in the mailbox, after
> first run there's 1040, then 1029 and so on. My guess would be the
> behaviour depends on what is stored in the mailbox, but that's pretty
> much all I can figure out on my own at this time.
>
> My question is - is this intended behaviour, ie. are you supposed to run
> doveadm deduplicate as long as the number of messages in the mailbox
> keeps changing? Or is it a bug? Tried to Google for the answer but no
> luck, so thanks for any answers.
>
> Tested on Dovecot version 2.2.9 and 2.2.12 (both from Debian repositories.)
    
    
More information about the dovecot
mailing list