Timo Sirainen wrote:
On Mon, 2008-06-02 at 23:25 +0100, Ed W wrote:
Hi
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Funnily enough it was on my todo list to whip up a small perl program to go and scan my maildirs and figure out if this theoretical idea actually amounted to anything.
Algorithm would be this:
Open each message, scan for first blank line. SHA the rest of the message, store the SHA in a hash (along with the message size) rinse and repeat and see if we end up with any hashes showing count greater than 1...
This would represent the best case that we could achieve assuming body content fixed and we find some way to manage variable headers.
Somewhat faster way would be to get a list of file sizes first and not bother checksumming any files which have a unique size.
Could do, but I was trying to expand to the case that the headers were different, but the body was the same (eg I suspect that mailing list managers might deliver emails one by one (verp), but the body is not customised. Anyway, just wanted to checksum the body of the message not the whole message
Actually the motivation for this was I was wondering about the benefit of a storage backend where the body was stored per file and the headers were stored separately (perhaps in a maildir type format). I haven't looked to see if this is what dbox does already...
I have been looking at git and brackup for backing up maildirs and it's got me thinking a bit more about mail storage algorithms
Ed W