On Mon, 2008-06-02 at 23:25 +0100, Ed W wrote:
Hi
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Funnily enough it was on my todo list to whip up a small perl program to go and scan my maildirs and figure out if this theoretical idea actually amounted to anything.
Algorithm would be this:
Open each message, scan for first blank line. SHA the rest of the message, store the SHA in a hash (along with the message size) rinse and repeat and see if we end up with any hashes showing count greater than 1...
This would represent the best case that we could achieve assuming body content fixed and we find some way to manage variable headers.
Somewhat faster way would be to get a list of file sizes first and not bother checksumming any files which have a unique size.
Next up is to use a mime parser and SHA each message part. Same idea, assuming we used some kind of format to store each part individually, how much gain is this really worth in terms of storage (looks tempting up front, condense all those duplicated jokes, etc - however, does it really bear out in practice...).
This is in my dbox TODO list (not near future though).
I think MS Exchange only does single instance storage like you describe here with delivery time hardlinking of messages? Never analysed what that was worth (back when I had an Exchange system to fiddle with...)
No idea about Exchange, but dbmail 2.3 does single instance MIME part storing.
I have a feeling that gzip compression of files would be worth more than this hardlinking (on many but not all mail systems...)
Or you could use both. zlib plugin already supports this with maildir.