Hi
The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.
This is a really interesting idea. I have previously given it some thought. My 2p
Being able to ask "the server" if it has an attachment matching a specific hash would be useful for a bunch of other reasons. This result needs to be (crytographically) unique and hence the hash needs to be a good hash (MD5/SHA or better) of the complete attachment, ideally after decoding
It might be useful to be able to find attachments with a specific hash regardless of whether the attachment has been spat out separately (think of a use case where we want to be able to spot a 2KB footer gif which on it's own isn't worth worrying about, but some offline scan later discovers 90% of emails contain this gif and we wish to split it off as a policy decision).
Storing attachments by hash may be interesting for use with specialist filesystems, eg an interesting direction that dbox could take might be to store the headers and message text in some (compressed?) format with high linear read rates and most attachments in a some key/value storage system?
Many modern IMAP clients are starting to download attachments on demand. Need to be able to supply only parts of the email efficiently without needing to pull in the blobs. Stated another way, it's desirable not to peek inside the blobs to be able to fetch arbitrary mime parts
It's going to be easy to break signed emails... Need to be careful
In many cases this isn't a performance win... It's still a *great* feature, but two disk seeks outweigh a lot of linear read speed.
When something gets corrupted... It's worth pondering about how we can audit and find unreferenced "blobs" later?
Some of the use cases I have for these features (just in case you care...). We have a feature which is a bit like the opposite of one of these services for sending big attachments. When users email arrives we remove all attachments that meet our criteria and replace them with links to the files. This requires being able to give users a coded link which can later be decoded to refer to a specific attachment. If this change offered us additional ways to find attachments by hash or whatever then it would be extremely useful
Another feature we offer is a client application which compresses and reduces bandwidth when sending/receiving emails. We currently don't try and hash bits of email, but it's an idea I have been mulling over for IMAP users where we typically see the data sent via SMTP, then uploaded to the imap "sent items", then often downloaded again when the client polls the sent items for new messages (durr). Being able to see if we have binary content which matches a specific hash could be extremely interesting
I'm not sure if with your current proposal I can do 100% of the above?
For example it's not clear if 4) is still possible? Also without a
"guaranteed" hash we can't use the hash as a lookup key in a key/value
storage system (which implies another mapping of keys to keys is
required). Can we do an (efficient) offline scan of messages looking for
duplicated hash keys (ie can the server calculate hashes for all
attachment parts ahead of time)
Sounds extremely interesting. Look forward to seeing this develop!
Cheers
Ed W