[Dovecot] (Single instance) attachment storage
Ed W
lists at wildgooses.com
Tue Aug 24 15:42:34 EEST 2010
Hi
> The idea is to have dbox and mdbox support saving attachments (or MIME
> parts in general) to separate files, which with some magic gives a
> possibility to do single instance attachment storage. Comments welcome.
This is a really interesting idea. I have previously given it some
thought. My 2p
1) Being able to ask "the server" if it has an attachment matching a
specific hash would be useful for a bunch of other reasons. This result
needs to be (crytographically) unique and hence the hash needs to be a
good hash (MD5/SHA or better) of the complete attachment, ideally after
decoding
2) It might be useful to be able to find attachments with a specific
hash regardless of whether the attachment has been spat out separately
(think of a use case where we want to be able to spot a 2KB footer gif
which on it's own isn't worth worrying about, but some offline scan
later discovers 90% of emails contain this gif and we wish to split it
off as a policy decision).
3) Storing attachments by hash may be interesting for use with
specialist filesystems, eg an interesting direction that dbox could take
might be to store the headers and message text in some (compressed?)
format with high linear read rates and most attachments in a some
key/value storage system?
4) Many modern IMAP clients are starting to download attachments on
demand. Need to be able to supply only parts of the email efficiently
without needing to pull in the blobs. Stated another way, it's
desirable not to peek inside the blobs to be able to fetch arbitrary
mime parts
5) It's going to be easy to break signed emails... Need to be careful
6) In many cases this isn't a performance win... It's still a *great*
feature, but two disk seeks outweigh a lot of linear read speed.
7) When something gets corrupted... It's worth pondering about how we
can audit and find unreferenced "blobs" later?
Some of the use cases I have for these features (just in case you
care...). We have a feature which is a bit like the opposite of one of
these services for sending big attachments. When users email arrives we
remove all attachments that meet our criteria and replace them with
links to the files. This requires being able to give users a coded link
which can later be decoded to refer to a specific attachment. If this
change offered us additional ways to find attachments by hash or
whatever then it would be extremely useful
Another feature we offer is a client application which compresses and
reduces bandwidth when sending/receiving emails. We currently don't try
and hash bits of email, but it's an idea I have been mulling over for
IMAP users where we typically see the data sent via SMTP, then uploaded
to the imap "sent items", then often downloaded again when the client
polls the sent items for new messages (durr). Being able to see if we
have binary content which matches a specific hash could be extremely
interesting
I'm not sure if with your current proposal I can do 100% of the above?
For example it's not clear if 4) is still possible? Also without a
"guaranteed" hash we can't use the hash as a lookup key in a key/value
storage system (which implies another mapping of keys to keys is
required). Can we do an (efficient) offline scan of messages looking for
duplicated hash keys (ie can the server calculate hashes for all
attachment parts ahead of time)
Sounds extremely interesting. Look forward to seeing this develop!
Cheers
Ed W
More information about the dovecot
mailing list