Timo Sirainen wrote:
On Thu, 2006-06-01 at 07:45 -0400, Charles Marcus wrote:
I have been looking for a good, open source imap server that doesn't store multiple copies of the same attachment - but instead, stores a checksum, and whenever a message is stored with a duplicate attachment, the attachment is stored only once, and simply referenced by some kind of link to other emails.
This is planned for dbox format in maybe a couple of months. I think the plan was to do this in deliver agent so that the delivered mail's attachment is shared between the mail's recipients.
Very good to hear! Were you planning to support this with both dbox storage options ('one mail per file' and 'multiple mails per file')?
I'm not sure if you're suggesting that checksum should be taken from the attachment and it be used to see if it already happens to exist, and if so use it. Actually I'm not sure if that was also what I was supposed to do anyway. :)
That is the way I had imagined it working - but of course, what is possible in my imagination and what is possible in reality almost always collide head on with a resulting explosion on a par with a supernova... ;)
I think that could anyway be a good idea, but how about hash collisions? I could just ignore that since they would practically never happen. Hash
- attachment size would be even safer.
Sounds great to me. I cannot 'imagine' the odds of both a hash collision AND an exact duplicate size at the same time, but there goes my imagination again...
The only truly safe way would be to read the whole attachment from disk and compare it byte-by-byte, but that'd just slow it down unneededly.. Perhaps it should be an option.
As one who likes options, if this isn't that hard to do, then yes - and maybe you could even have this be some kind of background process that occurs, or a nightly 'clean-up' job.
For example - store the attachments individually when they first come in, then every night at 3:00am, do a precise comparison on all of the attachments that came in that day and delete_duplicate->add_link on all duplicates found.
This tool could also be extended and used as a 'conversion' tool, to run on an existing mailstore.
Wow, now I'm getting excited, imagining our current 150GB+ storage being reduced to 1GB or less... !!!
--
Best regards,
Charles