[Dovecot] (Single instance) attachment storage
Timo Sirainen
tss at iki.fi
Tue Aug 24 18:48:34 EEST 2010
On Tue, 2010-08-24 at 15:48 +0100, Ed W wrote:
> Hi, thanks for responding
>
> > If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see
> > if it exists with:
> >
> > ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525
>
> This would be great for a bunch of uses, if the hash were unique and
> determinable from just some other content. eg if I could take a random
> file from my filesystem, compute a hash from it, and then checking for
> the existence of /attachments/*/*/hashes/my_hash determines whether
> there are any messages referencing that attachment?
Yeah, that would work. And the */* part can also be optimized by just
using the first 4 chars of the hash.
> >> 7) When something gets corrupted... It's worth pondering about how we
> >> can audit and find unreferenced "blobs" later?
> > Dovecot logs an error when it finds something unexpected. But there's
> > not a whole lot it can do then. And finding such broken attachments ..
> > well, I guess this'll already do it:
> >
>
> I was actually pondering the next step where there is some kind of
> single instance storage, say hardlinks, and you want to avoid a final
> dangling reference where there are no emails referencing that
> attachment? Such an issue depends on certain implementations of the
> single instance storage which finally develops, but it seems like a
> common problem to several of the obvious ways to do it?
Current implementation checks how many hard links are left for the hash
while deleting it. If it's deleting the last reference then the final
hashes/hash file is also deleted. It's of course possible that if it
crashes between these two deletes then there are some dangling hashes
left. I was thinking about maybe creating a tool to find and delete
those. It's easily done by just deleting everything
from /attachments/*/*/hashes/ directories that have a link count of 1.
A bigger problem is if a user's dbox directory is deleted/corrupted so
that unused files are left in /attachments/*/*/ directory. Finding and
deleting those pretty much requires reading through all dbox files for
all users..
> A related (but probably fairly rare) question would be (efficiently)
> finding all the messages which reference an attachment with a given
> unique hash.
Yeah. Not something I was planning on supporting.
> I can contrive a few reasons to ask this question, but
> perhaps someone else will tell me they are dying to known this stuff?
> (Find all emails which reference the company 2010 accounts pdf... Find
> all emails from employees with an attachment that matches our shadow
> password file...). Seems like we can do this just fine, only it will
> involve a lot of stats right now?
The mail -> attachment references are stored only inside dbox files in
the metadata fields.
> >> links to the files. This requires being able to give users a coded link
> >> which can later be decoded to refer to a specific attachment. If this
> >> change offered us additional ways to find attachments by hash or
> >> whatever then it would be extremely useful
> > I'm not sure if this change will help much. If the attachment changes
> > (especially in size) there will be problems..
>
> A unique hash would allow me to give out very simple URL links to
> customers, eg http://mysite/attachments/SHA_Hash
Ah. Yeah, that would work. You could just give the hash-guid reference
so that the link will no longer work after the message gets deleted.
Although the attachment hash-guid isn't available in any easy way, so
you'd have to add some extra code.
> > Related to that, I've been thinking of a transparent caching Dovecot
> > proxy.
>
> I could be interested in sponsoring such work if it got it higher up the
> ToDo list?! Please contact me with a proposal?
The related parts are:
1. Writing IMAP client lib-storage backend (possibly supporting, but not
requiring some Dovecot-specific extensions). But this can be annoyingly
slow, because lib-storage API is synchronous. So it needs:
2. Change lib-storage API to allow backends to be asynchronous. This is
wanted also for other high-latency backends, like key-value databases.
It could also improve everyone elses' performance by adding async disk
I/O support.
3. Add a caching proxy lib-storage backend that supports transparent
caching messages and/or indexes for other lib-storage backends.
I'm interested in 2. anyway. Also 1 is probably a nice and easy way to
test that it works, and 1 is also going to be nice for using dsync to
migrate to Dovecot from other random IMAP servers :) Then 3 probably
won't be all that difficult to implement. Anyway, I'm full time employed
until the end of November, although I'm not exactly sure what I'll be
working on soon..
> >> I'm not sure if with your current proposal I can do 100% of the above?
> >> For example it's not clear if 4) is still possible? Also without a
> >> "guaranteed" hash we can't use the hash as a lookup key in a key/value
> >> storage system (which implies another mapping of keys to keys is
> >> required).
> > Yeah, attachment-instance-key -> attachment-key -> attachment data
> > lookup would be the only safe way to do this.
>
> However, if the hash were a hash of the full message then we could
> completely avoid the double indirection? Within the limits of sensible
> probability, hashes can be considered unique and so there is "no"
> possibility of collision.
The hash is already a full hash of the message. I don't really like the
idea of trusting that a hash is unique. Especially because this could be
attacked against. Someone could read another user's attachment if they
only knew its hash and then were able to create another file with the
same hash and send it to themselves in the same system. (Sure, this
would require someone breaking SHA1. But the attachments with their SHA1
hashes could exist for many more years.)
I might make Dovecot trust the hash optionally anyway, but not
unconditionally.
More information about the dovecot
mailing list