On Tue, Jun 03, 2008 at 10:27:32AM +0200, Jost Krieger wrote:
On Tue, Jun 03, 2008 at 07:11:33AM +0100, Ed W wrote:
Could do, but I was trying to expand to the case that the headers were different, but the body was the same (eg I suspect that mailing list managers might deliver emails one by one (verp), but the body is not customised. Anyway, just wanted to checksum the body of the message not the whole message
That could lead to slight problems, like hardlinking totally unrelated messages, e.g. empty messages. Some Headers like From:, To:, Date:, Subject: should probably be identical.
Message-ID perhaps? :-)
For some consistency, just removing *locally* generated trace headers before fingerprinting might lead to better results.
That may still leave identical messages not hard-linked thus wasting space. Eg. if they come from MTA's that do recipient splitting, or messages that are routed via different systems. The Received headers will be different but the body generally identical.
I think a better solution is what was suggested here before, ie. to keep the (unique) message headers in a Maildir-like format, containing links to (single-instance stored) message bodies in a a separate location.
Geert