Hi Guys,
I am wondering about mail deduplication. I am looking into the possibility
of seperating out all of the message bodies with multiple parts inside mail
that is recived from dovecot
and hashing them all.
The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
Thanks, Tim
El 30/04/13 03:28, Tim Groeneveld escribió:
Hi Guys,
I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from
dovecot
and hashing them all.The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
I asked the same question recently. As Timo responsed at
http://kevat.dovecot.org/list/dovecot/2013-March/089072.html it seems that this feature is production stable in recent versions of dovecot.
And I think it is worth. My estimations (with just about 10 users of my
organization, they are no accurate) is that you can save more than 30% of total mail storage.
To configure it you need to use options:
- mail_attachment_dir
- mail_attachement_min_size
- mail_attachment_fs
- mail_attachment_hash
-- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 868888337
On 04/30/2013 08:05 AM, Angel L. Mateo wrote:
El 30/04/13 03:28, Tim Groeneveld escribió:
Hi Guys,
I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from
dovecot
and hashing them all.The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
I asked the same question recently. As Timo responsed at
http://kevat.dovecot.org/list/dovecot/2013-March/089072.html it seems that this feature is production stable in recent versions of dovecot.
And I think it is worth. My estimations (with just about 10 users
of my organization, they are no accurate) is that you can save more than 30% of total mail storage.
To configure it you need to use options:
- mail_attachment_dir
- mail_attachement_min_size
- mail_attachment_fs
- mail_attachment_hash
Hello,
Is it just working or is it working in a optimal way? back in October 2011 we noticed that the deduplication wasn't working as well as we were expecting as some files weren't properly deduplicated (http://markmail.org/message/ymfdwng7un2mj26z). Timo did you ever hit that bug and got it fixed if there was anything to fix on your side?
Since we are very interrested in this feature I am very eager to hear about admins using it on a similar scale (around 80,000 mailboxes).
Thanks,
Arnaud
-- Arnaud Abélard (jabber: arnaud.abelard@univ-nantes.fr) Administrateur Système - Responsable Services Web Direction des Systèmes d'Informations Université de Nantes
ne pas utiliser: trapemail@univ-nantes.fr
Wasn't there also some issue with cleanup of attachments ? Not being able to delete the last copy, or something. I did some testing of using SIS on a backup dsync destination a year (or two) ago, and got quite confused.. Don't quite remember the problems I had, but I did lose confidence in it and decided having the attachement together with the messages felt safest.
I would also love to hear from admins using it on large scale (100K+ active users). Maybe we should reconsider using it..
-jf
On Tue, Apr 30, 2013 at 9:04 AM, Arnaud Abélard < arnaud.abelard@univ-nantes.fr> wrote:
On 04/30/2013 08:05 AM, Angel L. Mateo wrote:
El 30/04/13 03:28, Tim Groeneveld escribió:
Hi Guys,
I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from
dovecot
and hashing them all.The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
I asked the same question recently. As Timo responsed at
http://kevat.dovecot.org/list/**dovecot/2013-March/089072.htmlhttp://kevat.dovecot.org/list/dovecot/2013-March/089072.htmlit seems that this feature is production stable in recent versions of dovecot.
And I think it is worth. My estimations (with just about 10 users
of my organization, they are no accurate) is that you can save more than 30% of total mail storage.
To configure it you need to use options:
- mail_attachment_dir
- mail_attachement_min_size
- mail_attachment_fs
- mail_attachment_hash
Hello,
Is it just working or is it working in a optimal way? back in October 2011 we noticed that the deduplication wasn't working as well as we were expecting as some files weren't properly deduplicated ( http://markmail.org/message/**ymfdwng7un2mj26zhttp://markmail.org/message/ymfdwng7un2mj26z). Timo did you ever hit that bug and got it fixed if there was anything to fix on your side?
Since we are very interrested in this feature I am very eager to hear about admins using it on a similar scale (around 80,000 mailboxes).
Thanks,
Arnaud
-- Arnaud Abélard (jabber: arnaud.abelard@univ-nantes.fr) Administrateur Système - Responsable Services Web Direction des Systèmes d'Informations Université de Nantes
ne pas utiliser: trapemail@univ-nantes.fr
El 30/04/13 11:22, Jan-Frode Myklebust escribió:
Wasn't there also some issue with cleanup of attachments ? Not being able to delete the last copy, or something. I did some testing of using SIS on a
In tests I have done (with dovecot 2.1.16) cleanup is done well. When
the last copy of the message is deleted, attachment is deleted. But you have to get in mind that when using mdbox, to really delete the message is have to be purged.
backup dsync destination a year (or two) ago, and got quite confused.. Don't quite remember the problems I had, but I did lose confidence in it and decided having the attachement together with the messages felt safest.
I would also love to hear from admins using it on large scale (100K+ active users). Maybe we should reconsider using it..
I'm planning to use it in a server with 60-70K users, but it is not in
production yet.
-- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 868888337
On 30.4.2013, at 12.22, Jan-Frode Myklebust janfrode@tanso.net wrote:
Wasn't there also some issue with cleanup of attachments ? Not being able to delete the last copy, or something. I did some testing of using SIS on a backup dsync destination a year (or two) ago, and got quite confused.. Don't quite remember the problems I had, but I did lose confidence in it and decided having the attachement together with the messages felt safest.
I would also love to hear from admins using it on large scale (100K+ active users). Maybe we should reconsider using it..
I'm not aware of any bugs in SIS, but yeah, it can be a bit complicated. If you do things like dsync where destination is also mdbox/sdbox, it's going to keep using the same SIS directory and updating the refcounts, which you probably don't want for backups / temp directories (solution: give different parameters to the two sides of dsync where the other side disables SIS).
On 2013-05-06 10:54 AM, Timo Sirainen tss@iki.fi wrote:
On 30.4.2013, at 12.22, Jan-Frode Myklebust janfrode@tanso.net wrote:
Wasn't there also some issue with cleanup of attachments ? Not being able to delete the last copy, or something. I did some testing of using SIS on a backup dsync destination a year (or two) ago, and got quite confused.. Don't quite remember the problems I had, but I did lose confidence in it and decided having the attachement together with the messages felt safest.
I would also love to hear from admins using it on large scale (100K+ active users). Maybe we should reconsider using it..
I'm not aware of any bugs in SIS, but yeah, it can be a bit complicated. If you do things like dsync where destination is also mdbox/sdbox, it's going to keep using the same SIS directory and updating the refcounts, which you probably don't want for backups / temp directories (solution: give different parameters to the two sides of dsync where the other side disables SIS).
Hey Timo - so, how will rsync be affected as a backup app? Will it maintain the deduped state in the backup target?
--
Best regards,
Charles
On 6.5.2013, at 18.03, Charles Marcus CMarcus@Media-Brokers.com wrote:
On 2013-05-06 10:54 AM, Timo Sirainen tss@iki.fi wrote:
On 30.4.2013, at 12.22, Jan-Frode Myklebust janfrode@tanso.net wrote:
Wasn't there also some issue with cleanup of attachments ? Not being able to delete the last copy, or something. I did some testing of using SIS on a backup dsync destination a year (or two) ago, and got quite confused.. Don't quite remember the problems I had, but I did lose confidence in it and decided having the attachement together with the messages felt safest.
I would also love to hear from admins using it on large scale (100K+ active users). Maybe we should reconsider using it..
I'm not aware of any bugs in SIS, but yeah, it can be a bit complicated. If you do things like dsync where destination is also mdbox/sdbox, it's going to keep using the same SIS directory and updating the refcounts, which you probably don't want for backups / temp directories (solution: give different parameters to the two sides of dsync where the other side disables SIS).
Hey Timo - so, how will rsync be affected as a backup app? Will it maintain the deduped state in the backup target?
Ideally you'd rsync from a filesystem snapshot instead of from live filesystem, otherwise the link counts might go wrong. And you need to use the -H parameter for rsync so it preserves hard links.
On 2013-05-06 11:23 AM, Timo Sirainen tss@iki.fi wrote:
On 6.5.2013, at 18.03, Charles Marcus CMarcus@Media-Brokers.com wrote:
Hey Timo - so, how will rsync be affected as a backup app? Will it maintain the deduped state in the backup target?
Ideally you'd rsync from a filesystem snapshot instead of from live filesystem, otherwise the link counts might go wrong. And you need to use the -H parameter for rsync so it preserves hard links.
Understood, and figured as much - I'll be using lvm snapshots and rsnapshot (which keeps backups using hardlinks against previous backup snapshots, making it easy to keep backups going back years without taking up much more additional space.
Thanks,
--
Best regards,
Charles
On 2013-05-06 12:06 PM, Charles Marcus CMarcus@Media-Brokers.com wrote:
Understood, and figured as much - I'll be using lvm snapshots and rsnapshot (which keeps backups using hardlinks against previous backup snapshots, making it easy to keep backups going back years without taking up much more additional space.
Specifically, it uses rsync, and then some manipulation magic to rotate the snapshots.
--
Best regards,
Charles
On 2013-04-30 2:05 AM, Angel L. Mateo amateo@um.es wrote:
El 30/04/13 03:28, Tim Groeneveld escribió:
I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from
dovecot
and hashing them all.The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
I asked the same question recently. As Timo responsed at
http://kevat.dovecot.org/list/dovecot/2013-March/089072.html it seems that this feature is production stable in recent versions of dovecot.
And I think it is worth. My estimations (with just about 10 users
of my organization, they are no accurate) is that you can save more than 30% of total mail storage.
To configure it you need to use options:
- mail_attachment_dir
- mail_attachement_min_size
- mail_attachment_fs
- mail_attachment_hash
This only dedupes attachments - which, in my opinion, is the only part of deduplicating email that is really worth it.
Yes, you might be able to recapture a miniscule amount of storage space as a percentage of total mailstore size by deduping the other mime parts (headers, body, etc), but the complexity of doing this for each message part in my opinion overkill, way too error-prone for my comfort level, and just not enough bang for the buck.
Deduping attachments on the other hand can have a dramatic impact (depending on your system usage and requirements), and is reliable enough to make it well worth it for some.
I am expecting at least a 40-60% reduction in our storage when I implement this on my new server soon (will report back once it is completed). We use a lot of large attachments, and our idiot users save multiple copies, resending the same one sometimes many multiple times to different people (so, maybe 3 or sometimes even 10+ copies of the same 20MB attachment in their Sent folder).
Anyway, thats my .02
--
Best regards,
Charles
----- Original Message -----
This only dedupes attachments - which, in my opinion, is the only part of deduplicating email that is really worth it.
[snip]
I am expecting at least a 40-60% reduction in our storage when I implement this on my new server soon.
Thanks guys for all of your messages. Maybe I was getting too excited about saving storage everywhere possible.
After thinking about it a little bit more, I have determined that just recombining the messages to send them to the client will be too intensive, and will cause extra latencies when retrieving emails.
Regards, Tim
On 2013-04-30 8:00 PM, Tim Groeneveld tim@timgws.com.au wrote:
After thinking about it a little bit more, I have determined that just recombining the messages to send them to the client will be too intensive, and will cause extra latencies when retrieving emails.
Scratching my head trying to figure out what you mean here... ?
What do you mean by 'recombining the messages'?
Again - SIS would not be doing any 'recombining' of anything at any time, and certainly would *never* cause any latency when users retrieve mail.
Also - 'retrieve mail'? Are you talking about POP here? SIS is much more useful in an IMAP environment - if yours is mixed, ok, but can't really see how it would be of much help in a POP only environment.
--
Best regards,
Charles
----- Original Message -----
On 2013-04-30 8:00 PM, Tim Groeneveld tim@timgws.com.au wrote:
After thinking about it a little bit more, I have determined that just recombining the messages to send them to the client will be too intensive, and will cause extra latencies when retrieving emails.
Scratching my head trying to figure out what you mean here... ?
What do you mean by 'recombining the messages'?
I was thinking of splitting all of the mime parts and recombining them later when the message was requested.
All of the parts would be hashed and stored separate to the message. This would mean things like image signatures and the like would only be stored once.
From what I understand, SIS does not do this. (that being said, I have not looked too deeply into SIS at the moment, as I am currently working on the elasticsearch FTS plugin)
Regards, Tim
El 07/05/13 02:19, Tim Groeneveld escribió:
----- Original Message -----
On 2013-04-30 8:00 PM, Tim Groeneveld tim@timgws.com.au wrote:
After thinking about it a little bit more, I have determined that just recombining the messages to send them to the client will be too intensive, and will cause extra latencies when retrieving emails.
Scratching my head trying to figure out what you mean here... ?
What do you mean by 'recombining the messages'?
I was thinking of splitting all of the mime parts and recombining them later when the message was requested.
All of the parts would be hashed and stored separate to the message. This would mean things like image signatures and the like would only be stored once.
From what I understand, SIS does not do this. (that being said, I have not looked too deeply into SIS at the moment, as I am currently working on the elasticsearch FTS plugin)
I think that SiS DOES exactly this. All attachments are splited from
the original message and stored in a common attachments directory. When the message is requested, then parts are recombined.
-- Angel L. Mateo Martínez Sección de Telemática Área de Tecnologías de la Información y las Comunicaciones Aplicadas (ATICA) http://www.um.es/atica Tfo: 868889150 Fax: 868888337
On 2013-05-07 2:22 AM, Angel L. Mateo amateo@um.es wrote:
El 07/05/13 02:19, Tim Groeneveld escribió:
I was thinking of splitting all of the mime parts and recombining them later when the message was requested.
All of the parts would be hashed and stored separate to the message. This would mean things like image signatures and the like would only be stored once.
From what I understand, SIS does not do this. (that being said, I have not looked too deeply into SIS at the moment, as I am currently working on the elasticsearch FTS plugin)
I think that SiS DOES exactly this.
That would be incorrect. SIS does *not* split the message up into its different MIME parts (ie, headers, body, etc).
All attachments are splited from the original message and stored in a common attachments directory. When the message is requested, then parts are recombined.
*Attachments*, yes (so, an image signature that was an *attachment* would be de-duped, but if it was an *embedded* graphic, I'm pretty sure it would *not* be.
--
Best regards,
Charles
On 7.5.2013, at 13.09, Charles Marcus CMarcus@Media-Brokers.com wrote:
On 2013-05-07 2:22 AM, Angel L. Mateo amateo@um.es wrote:
El 07/05/13 02:19, Tim Groeneveld escribió:
I was thinking of splitting all of the mime parts and recombining them later when the message was requested.
All of the parts would be hashed and stored separate to the message. This would mean things like image signatures and the like would only be stored once.
From what I understand, SIS does not do this. (that being said, I have not looked too deeply into SIS at the moment, as I am currently working on the elasticsearch FTS plugin)
I think that SiS DOES exactly this.
That would be incorrect. SIS does *not* split the message up into its different MIME parts (ie, headers, body, etc).
All attachments are splited from the original message and stored in a common attachments directory. When the message is requested, then parts are recombined.
*Attachments*, yes (so, an image signature that was an *attachment* would be de-duped, but if it was an *embedded* graphic, I'm pretty sure it would *not* be.
SIS doesn't by default care about if a MIME part is attachment or not. It stores externally all MIME parts that are large enough and don't have Content-Type: text/. There's a hook that plugins could implement a different logic, like for example not storing embedded images externally or checking for the Content-Disposition: attachment header.
On 2013-05-15 8:12 AM, Timo Sirainen tss@iki.fi wrote:
On 7.5.2013, at 13.09, Charles Marcus CMarcus@Media-Brokers.com wrote:
*Attachments*, yes (so, an image signature that was an *attachment* would be de-duped, but if it was an *embedded* graphic, I'm pretty sure it would *not* be. SIS doesn't by default care about if a MIME part is attachment or not. It stores externally all MIME parts that are large enough and don't have Content-Type: text/. There's a hook that plugins could implement a different logic, like for example not storing embedded images externally or checking for the Content-Disposition: attachment header.
Interesting... so it actually will SIS inline images/attachments if they are large enough...
Thanks for the correction Timo...
--
Best regards,
Charles
Dne 30.4.2013 03:28, Tim Groeneveld napsal:
Hi Guys,
I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from
dovecot
and hashing them all.The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
Thanks, Tim
Hi Tim,
thank you for your question. I am pleasure, because I can help you. I had the same problem in past and there wasn’t solution. So, I have written script which count md5 hashes from receive date and message body. Then script compare md5 hashes and delete duplicated messages. Script uses doveadm for message manipulation and openssl for counting md5 hashes. Deduplication is done through all user’s mailboxes. Syntax is dedup <user> <mailbox>, for example:
dedup name@domain.cz INBOX.
If you want dedup all mailboxes, enter –A instead of mailbox name:
dedup name@domain.cz –A.
Script is attached. I made it for my own use, so it isn’t stupid proof. If I can advise to you, work with care and make a backup ;-)
Good luck
#! /bin/sh
# Remove duplicate messages from mainbox
function dedup_mailbox () { local uids=( $(doveadm -f flow fetch -u $1 "uid" mailbox "$2" all | cut -f 2 -d =) ) if [ ${#uids[@]} -eq 0 ]; then echo " No messages" return elif [ ${#uids[@]} -eq 1 ]; then echo " Only one message" return fi
for (( i=0; i<${#uids[@]}; i++ )); do local md5s_u[$i]=$(echo $(doveadm -f flow fetch -u $1 "date.received body" mailbox "$2" uid ${uids[$i]} | openssl md5)",${uids[$i]}") echo -en " Compute hashes: $i/${#uids[@]}(${md5s_u[$i]})\r" done
echo -en "
\r"
local md5s=( $(echo ${md5s_u[@]} | sed 's/ /\n/g' | sort) )
x=0 i=0 while [ $i -lt $((${#md5s[@]} - 1)) ]; do A=$(echo ${md5s[$i]} | cut -f 1 -d ,) for (( j=$(($i + 1)); j<${#md5s[@]}; j++ )); do B=$(echo ${md5s[$j]} | cut -f 1 -d ,) if [ $A == $B ]; then doveadm expunge -u $1 mailbox "$2" uid $(echo ${md5s[$j]} | cut -f 2 -d ,) x=$(($x + 1)) else break fi done
echo -en " Expunged $x message(s) from $(($j + 1))/${#md5s[@]}\r" i=$j done echo "" }
if [ $2 == "-A" ]; then eval boxes=( $(doveadm mailbox list -u $1 | sed 's/.*/"&"/') ); else boxes[0]=$2 fi
for (( k=0; k<${#boxes[@]}; k++ )); do echo "${boxes[$k]}:" dedup_mailbox $1 "${boxes[$k]}" done
Tim,
oops, I read your message again and carefully. I see my mistake. You don't want delete whole duplicated messages but only their parts. So sorry for my reply, because It is quite out of topic.
Radek
Dne 30.4.2013 03:28, Tim Groeneveld napsal:
Hi Guys,
I am wondering about mail deduplication. I am looking into the possibility of seperating out all of the message bodies with multiple parts inside mail that is recived from
dovecot
and hashing them all.The idea is that by hashing all of the parts inside the email, I will be able to ensure that each part of the email will only be saved once.
This means that attachments & common parts of the body will only be saved once inside the storage.
How achievable would this be with the current state of dovecot? Would it even be worth doing?
Thanks, Tim
participants (7)
-
Angel L. Mateo
-
Arnaud Abélard
-
Charles Marcus
-
Jan-Frode Myklebust
-
Radek Novotný
-
Tim Groeneveld
-
Timo Sirainen