On Tue, 2010-08-24 at 13:42 +0100, Ed W wrote:
Hi
The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.
This is a really interesting idea. I have previously given it some thought. My 2p
- Being able to ask "the server" if it has an attachment matching a specific hash would be useful for a bunch of other reasons.
If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see if it exists with:
ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525
This result needs to be (crytographically) unique and hence the hash needs to be a good hash (MD5/SHA or better) of the complete attachment,
Currently it uses SHA1, but this can be changed anytime. I didn't bother to make it configurable. The hash's security isn't a huge issue since it does byte-by-byte comparison anyway.
ideally after decoding
The hash is after decoding base64, if attachment is saved decoded, and that happens if it can be re-encoded exactly as it was.
- It might be useful to be able to find attachments with a specific hash regardless of whether the attachment has been spat out separately (think of a use case where we want to be able to spot a 2KB footer gif which on it's own isn't worth worrying about, but some offline scan later discovers 90% of emails contain this gif and we wish to split it off as a policy decision).
I guess that would be possible, but it would require reading and parsing all of the mail files. That could take a while. The finding part wouldn't be all that much work, but separating attachments out of already saved mails is kind of annoying.
- Storing attachments by hash may be interesting for use with specialist filesystems, eg an interesting direction that dbox could take might be to store the headers and message text in some (compressed?) format with high linear read rates and most attachments in a some key/value storage system?
The attachment I/O is done via filesystem API, so this would be possible easily by just writing FS API backend for a key-value database.
- Many modern IMAP clients are starting to download attachments on demand. Need to be able to supply only parts of the email efficiently without needing to pull in the blobs. Stated another way, it's desirable not to peek inside the blobs to be able to fetch arbitrary mime parts
This is already done .. in theory anyway. I'm not sure yet if some prefetching code causes the attachments to be read unnecessarily. Should test it.
- It's going to be easy to break signed emails... Need to be careful
Yeah, I wasn't planning on breaking them.
- In many cases this isn't a performance win... It's still a *great* feature, but two disk seeks outweigh a lot of linear read speed.
Sure, not a performance win. But that's not what it was meant for. :) But if only >1MB (or so) attachments were stored separately that should get rid of the worst offenders without impacting performance much.
- When something gets corrupted... It's worth pondering about how we can audit and find unreferenced "blobs" later?
Dovecot logs an error when it finds something unexpected. But there's not a whole lot it can do then. And finding such broken attachments .. well, I guess this'll already do it:
doveadm fetch -A body all > /dev/null
Some of the use cases I have for these features (just in case you care...). We have a feature which is a bit like the opposite of one of these services for sending big attachments. When users email arrives we remove all attachments that meet our criteria and replace them with links to the files. This requires being able to give users a coded link which can later be decoded to refer to a specific attachment. If this change offered us additional ways to find attachments by hash or whatever then it would be extremely useful
I'm not sure if this change will help much. If the attachment changes (especially in size) there will be problems..
Another feature we offer is a client application which compresses and reduces bandwidth when sending/receiving emails. We currently don't try and hash bits of email, but it's an idea I have been mulling over for IMAP users where we typically see the data sent via SMTP, then uploaded to the imap "sent items", then often downloaded again when the client polls the sent items for new messages (durr). Being able to see if we have binary content which matches a specific hash could be extremely interesting
Related to that, I've been thinking of a transparent caching Dovecot proxy.
I'm not sure if with your current proposal I can do 100% of the above?
For example it's not clear if 4) is still possible? Also without a "guaranteed" hash we can't use the hash as a lookup key in a key/value storage system (which implies another mapping of keys to keys is required).
Yeah, attachment-instance-key -> attachment-key -> attachment data lookup would be the only safe way to do this.
Can we do an (efficient) offline scan of messages looking for duplicated hash keys (ie can the server calculate hashes for all attachment parts ahead of time)
Well .. the way it works is that you have files:
hash-guid hash2-guid2 hashes/hash hashes/hash2
If two attachments have the same hash but different content, you'll end up with:
hash-guid1 hash-guid2 hashes/hash
Where hash-guid1 and hash-guid2 are different files, and only one of them is hard linked to hashes/hash. To find duplicates, you can stat() all files and find which have identical hash but different inode.