On Mon, 2010-07-19 at 19:49 +0100, William Blunn wrote:
I thought about that also, but it would require calculating and using a hash of the decoded message (but not the compressed message). Could get complex.
BTW I am not attempting to suggest a complete system for de-duplication, but rather to suggest a means by which it could be arranged that file contents became identical so that "something else" could de-duplicate them elsehow.
I would be interested to know what the hash you mention is needed for.
If you rely on filesystem's deduplication, nothing. But if Dovecot does SIS internally, it needs the hash to see if the attachment is already stored.
Also I would be interested to know why the hash of the fragment of the original message stream (regardless of base64 decodeability) would not be sufficient.
If two users have the same file but with different base64-encoding, then their hashes are different and Dovecot can't do SIS.
I was thinking about adding some small header to the dbox file, so they wouldn't be completely identical.
Though that is kind of the point. If everything in the small header can go somewhere else then the small header can go away and we can store the attachment very literally.
What kind of things are you thinking to put in the small header?
I was thinking it would be nice to be able to compress attachments after they've already been delivered. Like maybe keep the attachments decoded for a few weeks and then compress them. Similar to how some people do it with Maildir. This can't work without a small header, otherwise you can't know if the attachment was originally compressed or not.