[Dovecot] Mail deduplication

Charles Marcus CMarcus at Media-Brokers.com
Tue Apr 30 16:52:22 EEST 2013


On 2013-04-30 2:05 AM, Angel L. Mateo <amateo at um.es> wrote:
> El 30/04/13 03:28, Tim Groeneveld escribió:
>> I am wondering about mail deduplication. I am looking into the 
>> possibility
>> of seperating out all of the message bodies with multiple parts 
>> inside mail
>> that is recived from `dovecot` and hashing them all.
>>
>> The idea is that by hashing all of the parts inside the email, I will be
>> able to ensure that each part of the email will only be saved once.
>>
>> This means that attachments & common parts of the body will only be
>> saved once inside the storage.
>>
>> How achievable would this be with the current state of dovecot? Would it
>> even be worth doing?
>>
>     I asked the same question recently. As Timo responsed at 
> http://kevat.dovecot.org/list/dovecot/2013-March/089072.html it seems 
> that this feature is production stable in recent versions of dovecot.
>
>     And I think it is worth. My estimations (with just about 10 users 
> of my organization, they are no accurate) is that you can save more 
> than 30% of total mail storage.
>
>     To configure it you need to use options:
>
> * mail_attachment_dir
> * mail_attachement_min_size
> * mail_attachment_fs
> * mail_attachment_hash

This only dedupes attachments - which, in my opinion, is the only part 
of deduplicating email that is really worth it.

Yes, you might be able to recapture a miniscule amount of storage space 
as a percentage of total mailstore size by deduping the other mime parts 
(headers, body, etc), but the complexity of doing this for each message 
part in my opinion overkill, way too error-prone for my comfort level, 
and just not enough bang for the buck.

Deduping attachments on the other hand can have a dramatic impact 
(depending on your system usage and requirements), and is reliable 
enough to make it well worth it for some.

I am expecting at least a 40-60% reduction in our storage when I 
implement this on my new server soon (will report back once it is 
completed). We use a lot of large attachments, and our idiot users save 
multiple copies, resending the same one sometimes many multiple times to 
different people (so, maybe 3 or sometimes even 10+ copies of the same 
20MB attachment in their Sent folder).

Anyway, thats my .02

-- 

Best regards,

Charles




More information about the dovecot mailing list