[Dovecot] Duplicate Attachments....
Love Dovecot, many thanks for it!
But, I have a question regarding the storing of emails, with respect to efficiency...
Our business model (advertising industry) is such that our users exchange a lot of emails with attachments - most less than a megabyte, but some considerably larger. Consequently, I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This would *drastically* reduce the storage requirements for our company
- imagine a message with a 10MB attachment, sent to 40 of our users, sometimes more than once. Now multiply this by 3 times per day, for 5 years...
Are there any plans for Dovecot to support this type of storage in the future? Does this require the use of an SQL DB for storing the message components?
--
Best regards,
Charles
On Thu, 2006-06-01 at 07:45 -0400, Charles Marcus wrote:
Our business model (advertising industry) is such that our users exchange a lot of emails with attachments - most less than a megabyte, but some considerably larger. Consequently, I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This would *drastically* reduce the storage requirements for our company
- imagine a message with a 10MB attachment, sent to 40 of our users, sometimes more than once. Now multiply this by 3 times per day, for 5 years...
Are there any plans for Dovecot to support this type of storage in the future? Does this require the use of an SQL DB for storing the message components?
This is planned for dbox format in maybe a couple of months. I think the plan was to do this in deliver agent so that the delivered mail's attachment is shared between the mail's recipients.
I'm not sure if you're suggesting that checksum should be taken from the attachment and it be used to see if it already happens to exist, and if so use it. Actually I'm not sure if that was also what I was supposed to do anyway. :)
I think that could anyway be a good idea, but how about hash collisions? I could just ignore that since they would practically never happen. Hash that'd just slow it down unneededly.. Perhaps it should be an option.
- attachment size would be even safer. The only truly safe way would be to read the whole attachment from disk and compare it byte-by-byte, but
Timo Sirainen wrote:
I think that could anyway be a good idea, but how about hash collisions? I could just ignore that since they would practically never happen. Hash that'd just slow it down unneededly.. Perhaps it should be an option.
- attachment size would be even safer. The only truly safe way would be to read the whole attachment from disk and compare it byte-by-byte, but
Collisions won't be a big problem if you use something like SHA, but it would be slow. You have to generate a checksum for both sides of the comparison, meaning you have to generate it at least once per message. Generating it always means reading every byte of it.
A better solution might be for the LDA to detect which messages are being delivered locally to more than one user. I could then make the message file shared (in the case of Maildir anyway), for example by hard-linking the files. The message would then exist on disk in one copy until each client have removed it (bringing the link count to 0).
Just an idea.
- Tore
On Thu, 2006-06-01 at 09:10 -0400, Tore André Klock wrote:
Timo Sirainen wrote:
I think that could anyway be a good idea, but how about hash collisions? I could just ignore that since they would practically never happen. Hash that'd just slow it down unneededly.. Perhaps it should be an option.
- attachment size would be even safer. The only truly safe way would be to read the whole attachment from disk and compare it byte-by-byte, but
Collisions won't be a big problem if you use something like SHA, but it would be slow. You have to generate a checksum for both sides of the comparison, meaning you have to generate it at least once per message. Generating it always means reading every byte of it.
The delivered mail's every byte has to be read anyway, and for the stored attachment the filename would already contains the checksum. I don't think it takes too much extra time to calculate the attachment's checksum while it's being read.
A better solution might be for the LDA to detect which messages are being delivered locally to more than one user.
That's what I was originally thinking instead of checksums.
I could then make the message file shared (in the case of Maildir anyway), for example by hard-linking the files. The message would then exist on disk in one copy until each client have removed it (bringing the link count to 0).
One problem that I see with hardlinking maildir files is that then you can't have Delivered-To (or similar) header separate for each user. I don't know if that's a real problem though..
Timo Sirainen wrote:
One problem that I see with hardlinking maildir files is that then you can't have Delivered-To (or similar) header separate for each user. I don't know if that's a real problem though.. I jumped the gun on this one, thinking I saw a simple solution. Too simple to be true :) A possible modification would be to only treat attachments that way, but then you're getting real close to the MIMEDefang solution anyway.
The solution I like best so far in this thread is the MIMEDefang solution Steffen is using. Oh, and maybe teaching your users to use a file server or FTP site (kidding).
- Tore
Tore André Klock wrote:
Timo Sirainen wrote:
One problem that I see with hardlinking maildir files is that then you can't have Delivered-To (or similar) header separate for each user. I don't know if that's a real problem though..
I jumped the gun on this one, thinking I saw a simple solution. Too simple to be true :) A possible modification would be to only treat attachments that way, but then you're getting real close to the MIMEDefang solution anyway.
The solution I like best so far in this thread is the MIMEDefang solution Steffen is using.
I have considered MIMEDefang, but I prefer not to alter the original message (adding a hyperlink, and descriptive text about the removed attachment). Directly supporting this in the IMAP server is the ideal way to go, imho...
Oh, and maybe teaching your users to use a file server or FTP site (kidding).
One can only dream... ;)
--
Best regards,
Charles
On 6/1/06, Tore André Klock klock@beacon.com wrote:
A better solution might be for the LDA to detect which messages are being delivered locally to more than one user. I could then make the message file shared (in the case of Maildir anyway), for example by hard-linking the files. The message would then exist on disk in one copy until each client have removed it (bringing the link count to 0).
This is almost exactly what I've been looking for/designing for official mail outs to multiple users on the system. I was going to try a system, of putting a message in a shared box (using ACL's to control who could write to that box), and then having a runner check the box once an hr or something, check the specially crafted To: header, which would do a DB lookup and then start runners hardlinking the file (after moving to a processing tmp box) into all the (correct folder) Maildir's of the recipients. The idea being to create a sort of mailing list, for official mail outs to the different sections of the organisation. e.g. To: accounting-group@host, committe-group@host With each group expanding in to possibly hundreds of users, based on a lookup, and only 1 copy stored on the system.
I think for what I'm doing, I might still have to implement the runner system or try and generalise the groups more, and use shared folders. But still, if a user sends the same email to 50 people, having dovecot-lda detect this, and only store one copy, would be a bonus!
Tim
Linux Counter user #273956
On Thu, 1 Jun 2006, Timo Sirainen wrote:
On Thu, 2006-06-01 at 07:45 -0400, Charles Marcus wrote:
Our business model (advertising industry) is such that our users exchange a lot of emails with attachments - most less than a megabyte, but some considerably larger. Consequently, I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
We do this by mangling the mail during the submission, via MIMEDefang from Roaring Penguin, http://www.roaringpenguin.com/penguin/open_source_mimedefang.php action_replace_with_url(). The attachment is then spooled on the host and accessed with URL (http:). BTW, the filename is the SHA1 of the content
Bye,
-- Steffen Kaiser
Timo Sirainen wrote:
On Thu, 2006-06-01 at 07:45 -0400, Charles Marcus wrote:
I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This is planned for dbox format in maybe a couple of months. I think the plan was to do this in deliver agent so that the delivered mail's attachment is shared between the mail's recipients.
Very good to hear! Were you planning to support this with both dbox storage options ('one mail per file' and 'multiple mails per file')?
I'm not sure if you're suggesting that checksum should be taken from the attachment and it be used to see if it already happens to exist, and if so use it. Actually I'm not sure if that was also what I was supposed to do anyway. :)
That is the way I had imagined it working - but of course, what is possible in my imagination and what is possible in reality almost always collide head on with a resulting explosion on a par with a supernova... ;)
I think that could anyway be a good idea, but how about hash collisions? I could just ignore that since they would practically never happen. Hash
- attachment size would be even safer.
Sounds great to me. I cannot 'imagine' the odds of both a hash collision AND an exact duplicate size at the same time, but there goes my imagination again...
The only truly safe way would be to read the whole attachment from disk and compare it byte-by-byte, but that'd just slow it down unneededly.. Perhaps it should be an option.
As one who likes options, if this isn't that hard to do, then yes - and maybe you could even have this be some kind of background process that occurs, or a nightly 'clean-up' job.
For example - store the attachments individually when they first come in, then every night at 3:00am, do a precise comparison on all of the attachments that came in that day and delete_duplicate->add_link on all duplicates found.
This tool could also be extended and used as a 'conversion' tool, to run on an existing mailstore.
Wow, now I'm getting excited, imagining our current 150GB+ storage being reduced to 1GB or less... !!!
--
Best regards,
Charles
On Thu, 2006-06-01 at 09:18 -0400, Charles Marcus wrote:
This is planned for dbox format in maybe a couple of months. I think the plan was to do this in deliver agent so that the delivered mail's attachment is shared between the mail's recipients.
Very good to hear! Were you planning to support this with both dbox storage options ('one mail per file' and 'multiple mails per file')?
Well, the dbox format itself is built so that it always supports having multiple mails in a file. It's just that you can configure it not to put more than one message into a file. So they aren't separate options really. :)
But I guess I should prepare for this single storage extension in the dbox format already now before it becomes more widely used..
For example - store the attachments individually when they first come in, then every night at 3:00am, do a precise comparison on all of the attachments that came in that day and delete_duplicate->add_link on all duplicates found.
That could be a possibility too. Although that way delivery would use more disk I/O than really needed for the shared attachments.
For example - store the attachments individually when they first come in, then every night at 3:00am, do a precise comparison on all of the attachments that came in that day and delete_duplicate->add_link on all duplicates found.
That could be a possibility too. Although that way delivery would use more disk I/O than really needed for the shared attachments.
As long as it wasn't permanent, I really don't see that as an issue, but of course, *ideally* if it could happen at delivery time that would obviously be best.
But, I still really like the idea of being able to process an existing mailstore (since we have a huge one), rather than only processing new messages.
What are the chances of providing for both? Meaning, 'process on delivery', and/or 'delayed process' (ie nightly) / 'process existing' (process an existing mailstore)?
Maybe even complicate it, and provide a way of testing for how busy the server is, and if it is too busy when a message comes in, delay processing, but when it isn't very busy, process immediately?
;)
--
Best regards,
Charles
On Thu, June 1, 2006 8:37, Timo Sirainen said:
On Thu, 2006-06-01 at 07:45 -0400, Charles Marcus wrote:
Our business model (advertising industry) is such that our users exchange a lot of emails with attachments - most less than a megabyte, but some considerably larger. Consequently, I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This would *drastically* reduce the storage requirements for our company
- imagine a message with a 10MB attachment, sent to 40 of our users, sometimes more than once. Now multiply this by 3 times per day, for 5 years...
Are there any plans for Dovecot to support this type of storage in the future? Does this require the use of an SQL DB for storing the message components?
This is planned for dbox format in maybe a couple of months. I think the plan was to do this in deliver agent so that the delivered mail's attachment is shared between the mail's recipients.
How would you know when all users have deleted an email that has a shared attachment so that you can safely delete the shared attachment file?
[As I was typing this question I thought of this]... Perhaps an integer associated with the file that represents the number of people pointing to the shared attachment. Decrement the integer when each user deletes the related email. Delete the shared attachment file when the integer reaches 0.
I like this concept. Is dbox ready for production in its current state? (obviously without this proposed feature)
Bill
On Thu, 2006-06-01 at 20:31, Bill Boebel wrote:
Our business model (advertising industry) is such that our users exchange a lot of emails with attachments - most less than a megabyte, but some considerably larger. Consequently, I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This would *drastically* reduce the storage requirements for our company
- imagine a message with a 10MB attachment, sent to 40 of our users, sometimes more than once. Now multiply this by 3 times per day, for 5 years...
Are there any plans for Dovecot to support this type of storage in the future? Does this require the use of an SQL DB for storing the message components?
This is planned for dbox format in maybe a couple of months. I think the plan was to do this in deliver agent so that the delivered mail's attachment is shared between the mail's recipients.
How would you know when all users have deleted an email that has a shared attachment so that you can safely delete the shared attachment file?
One approach is to use maildir or a similar one-message-per-file storage format and have the delivery agent make hardlinks in the filesystem for each copy. Unix filesystem semantics ensure that the data won't go away until the last link is deleted. I think Cyrus has an option to work that way.
-- Les Mikesell lesmikesell@gmail.com
Bill Boebel wrote:
I like this concept.
As do I.
Is dbox ready for production in its current state? (obviously without this proposed feature)
Not yet on large scale, but it is getting there. I don't know the number of people running dbox (in the lab etc) at the moment but every new tester is welcome at this point.
Tomi
(06.06.01 kl.07:45) Charles Marcus skrev följande till dovecot@dovecot.org:
Love Dovecot, many thanks for it!
But, I have a question regarding the storing of emails, with respect to efficiency...
Our business model (advertising industry) is such that our users exchange a lot of emails with attachments - most less than a megabyte, but some considerably larger. Consequently, I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment
- but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This would *drastically* reduce the storage requirements for our company - imagine a message with a 10MB attachment, sent to 40 of our users, sometimes more than once. Now multiply this by 3 times per day, for 5 years...
Are there any plans for Dovecot to support this type of storage in the future? Does this require the use of an SQL DB for storing the message components?
I thought I'd just mention aradis which can extract and replace the attachment. The attachment itself can then be delivered to a webserver, the filesystem or a custom script.
http://robur.slu.se/jensl/aradis
(I am the author of aradis and we use it mainly for mailinglists.:-)
Cheers, Jens
--
Best regards,
Charles
'In theory, there is no difference between theory and practice.
But, in practice, there is.'
Jens Låås Email: jens.laas@data.slu.se
Department of Computer Services, SLU Phone: +46 18 67 35 15
Vindbrovägen 1
P.O. Box 7079
S-750 07 Uppsala
SWEDEN
participants (9)
-
Bill Boebel
-
Charles Marcus
-
Jens Laas
-
Les Mikesell
-
Steffen Kaiser
-
Timo Sirainen
-
Timothy White
-
Tomi Hakala
-
Tore André Klock