Timo Sirainen wrote:
On Mon, 2010-07-19 at 18:30 +0100, William Blunn wrote:
Consider storing the recovery filter stack in the dbox metadata rather than the attachment file.
This has a couple of upshots:
- If one person receives a message with an attachment which is encoded with base64 at say 19 cells (76 bytes) per line, and then re-sends the same file as an attachment to someone else but their MUA encodes base64 at say 18 cells (72 bytes) per line, the attachment file can contain exactly the same data, allowing for deduplication even in this case.
I thought about that also, but it would require calculating and using a hash of the decoded message (but not the compressed message). Could get complex.
BTW I am not attempting to suggest a complete system for de-duplication, but rather to suggest a means by which it could be arranged that file contents became identical so that "something else" could de-duplicate them elsehow.
I would be interested to know what the hash you mention is needed for.
Also I would be interested to know why the hash of the fragment of the original message stream (regardless of base64 decodeability) would not be sufficient.
And if it isn't...
if (base64_smart_decode(&raw_data, &decoded_data, &chars_per_line) == SUCCESS) { // store decoded_data to attachment file // recovery_filter = "base64_" .concat. chars_per_line } else { // store raw_data to attachment file // recovery_filter = nothing }
// make hash of attachment file // store pointer to dbox metadata including recovery_filter
- Assuming we have configured Dovecot to decode base64 but not to compress, then the file in which we store the attachment data contains literally the exact same byte stream as if the attachment were saved out from the MUA. I don't know what practical use this might be, but it /sounds/ cool :-) Perhaps a suitable filesystem or backup-system could deduplicate both a file *and* its instance as a message attachment.
I was thinking about adding some small header to the dbox file, so they wouldn't be completely identical.
Though that is kind of the point. If everything in the small header can go somewhere else then the small header can go away and we can store the attachment very literally.
What kind of things are you thinking to put in the small header?
Bill