[Dovecot] (Single instance) attachment storage
Timo Sirainen
tss at iki.fi
Tue Aug 24 17:14:44 EEST 2010
On Tue, 2010-08-24 at 13:42 +0100, Ed W wrote:
> Hi
>
> > The idea is to have dbox and mdbox support saving attachments (or MIME
> > parts in general) to separate files, which with some magic gives a
> > possibility to do single instance attachment storage. Comments welcome.
>
> This is a really interesting idea. I have previously given it some
> thought. My 2p
>
> 1) Being able to ask "the server" if it has an attachment matching a
> specific hash would be useful for a bunch of other reasons.
If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see
if it exists with:
ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525
> This result
> needs to be (crytographically) unique and hence the hash needs to be a
> good hash (MD5/SHA or better) of the complete attachment,
Currently it uses SHA1, but this can be changed anytime. I didn't bother
to make it configurable. The hash's security isn't a huge issue since it
does byte-by-byte comparison anyway.
> ideally after decoding
The hash is after decoding base64, if attachment is saved decoded, and
that happens if it can be re-encoded exactly as it was.
> 2) It might be useful to be able to find attachments with a specific
> hash regardless of whether the attachment has been spat out separately
> (think of a use case where we want to be able to spot a 2KB footer gif
> which on it's own isn't worth worrying about, but some offline scan
> later discovers 90% of emails contain this gif and we wish to split it
> off as a policy decision).
I guess that would be possible, but it would require reading and parsing
all of the mail files. That could take a while. The finding part
wouldn't be all that much work, but separating attachments out of
already saved mails is kind of annoying.
> 3) Storing attachments by hash may be interesting for use with
> specialist filesystems, eg an interesting direction that dbox could take
> might be to store the headers and message text in some (compressed?)
> format with high linear read rates and most attachments in a some
> key/value storage system?
The attachment I/O is done via filesystem API, so this would be possible
easily by just writing FS API backend for a key-value database.
> 4) Many modern IMAP clients are starting to download attachments on
> demand. Need to be able to supply only parts of the email efficiently
> without needing to pull in the blobs. Stated another way, it's
> desirable not to peek inside the blobs to be able to fetch arbitrary
> mime parts
This is already done .. in theory anyway. I'm not sure yet if some
prefetching code causes the attachments to be read unnecessarily. Should
test it.
> 5) It's going to be easy to break signed emails... Need to be careful
Yeah, I wasn't planning on breaking them.
> 6) In many cases this isn't a performance win... It's still a *great*
> feature, but two disk seeks outweigh a lot of linear read speed.
Sure, not a performance win. But that's not what it was meant for. :)
But if only >1MB (or so) attachments were stored separately that should
get rid of the worst offenders without impacting performance much.
> 7) When something gets corrupted... It's worth pondering about how we
> can audit and find unreferenced "blobs" later?
Dovecot logs an error when it finds something unexpected. But there's
not a whole lot it can do then. And finding such broken attachments ..
well, I guess this'll already do it:
doveadm fetch -A body all > /dev/null
> Some of the use cases I have for these features (just in case you
> care...). We have a feature which is a bit like the opposite of one of
> these services for sending big attachments. When users email arrives we
> remove all attachments that meet our criteria and replace them with
> links to the files. This requires being able to give users a coded link
> which can later be decoded to refer to a specific attachment. If this
> change offered us additional ways to find attachments by hash or
> whatever then it would be extremely useful
I'm not sure if this change will help much. If the attachment changes
(especially in size) there will be problems..
> Another feature we offer is a client application which compresses and
> reduces bandwidth when sending/receiving emails. We currently don't try
> and hash bits of email, but it's an idea I have been mulling over for
> IMAP users where we typically see the data sent via SMTP, then uploaded
> to the imap "sent items", then often downloaded again when the client
> polls the sent items for new messages (durr). Being able to see if we
> have binary content which matches a specific hash could be extremely
> interesting
Related to that, I've been thinking of a transparent caching Dovecot
proxy.
> I'm not sure if with your current proposal I can do 100% of the above?
> For example it's not clear if 4) is still possible? Also without a
> "guaranteed" hash we can't use the hash as a lookup key in a key/value
> storage system (which implies another mapping of keys to keys is
> required).
Yeah, attachment-instance-key -> attachment-key -> attachment data
lookup would be the only safe way to do this.
> Can we do an (efficient) offline scan of messages looking for
> duplicated hash keys (ie can the server calculate hashes for all
> attachment parts ahead of time)
Well .. the way it works is that you have files:
hash-guid
hash2-guid2
hashes/hash
hashes/hash2
If two attachments have the same hash but different content, you'll end
up with:
hash-guid1
hash-guid2
hashes/hash
Where hash-guid1 and hash-guid2 are different files, and only one of
them is hard linked to hashes/hash. To find duplicates, you can stat()
all files and find which have identical hash but different inode.
More information about the dovecot
mailing list