[Dovecot] (Single instance) attachment storage

Tue Aug 24 17:48:48 EEST 2010

  Hi, thanks for responding

> If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see
> if it exists with:
>
> ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525

This would be great for a bunch of uses, if the hash were unique and 
determinable from just some other content.  eg if I could take a random 
file from my filesystem, compute a hash from it, and then checking for 
the existence of /attachments/*/*/hashes/my_hash determines whether 
there are any messages referencing that attachment?

>> format with high linear read rates and most attachments in a some
>> key/value storage system?
> The attachment I/O is done via filesystem API, so this would be possible
> easily by just writing FS API backend for a key-value database.

Cool

>> 7) When something gets corrupted... It's worth pondering about how we
>> can audit and find unreferenced "blobs" later?
> Dovecot logs an error when it finds something unexpected. But there's
> not a whole lot it can do then. And finding such broken attachments ..
> well, I guess this'll already do it:
>

I was actually pondering the next step where there is some kind of 
single instance storage, say hardlinks, and you want to avoid a final 
dangling reference where there are no emails referencing that 
attachment?  Such an issue depends on certain implementations of the 
single instance storage which finally develops, but it seems like a 
common problem to several of the obvious ways to do it?

A related (but probably fairly rare) question would be (efficiently) 
finding all the messages which reference an attachment with a given 
unique hash.  I can contrive a few reasons to ask this question, but 
perhaps someone else will tell me they are dying to known this stuff?  
(Find all emails which reference the company 2010 accounts pdf... Find 
all emails from employees with an attachment that matches our shadow 
password file...).  Seems like we can do this just fine, only it will 
involve a lot of stats right now?

>> links to the files.  This requires being able to give users a coded link
>> which can later be decoded to refer to a specific attachment.  If this
>> change offered us additional ways to find attachments by hash or
>> whatever then it would be extremely useful
> I'm not sure if this change will help much. If the attachment changes
> (especially in size) there will be problems..

A unique hash would allow me to give out very simple URL links to 
customers, eg http://mysite/attachments/SHA_Hash

At present I have a fairly convoluted scheme which uses message ids and 
this has a whole bunch of issues...

> Related to that, I've been thinking of a transparent caching Dovecot
> proxy.

I could be interested in sponsoring such work if it got it higher up the 
ToDo list?!  Please contact me with a proposal?

>> I'm not sure if with your current proposal I can do 100% of the above?
>> For example it's not clear if 4) is still possible?  Also without a
>> "guaranteed" hash we can't use the hash as a lookup key in a key/value
>> storage system (which implies another mapping of keys to keys is
>> required).
> Yeah, attachment-instance-key ->  attachment-key ->  attachment data
> lookup would be the only safe way to do this.

However, if the hash were a hash of the full message then we could 
completely avoid the double indirection?  Within the limits of sensible 
probability, hashes can be considered unique and so there is "no" 
possibility of collision.

If you were to use highly unique hashes as your keys then you can 
dispense with certain levels of indirection here and hashes would be 
"guaranteed" unique if the content is unique.  This allows you to do a 
straight compare of all hashes anywhere and two hashes the same == same 
content

(I'm caveating all "unique" claims in case someone points out that it's 
theoretically possible to get collisions)Well .. the way it works is 
that you have files:

> hash-guid
> hash2-guid2
> hashes/hash
> hashes/hash2
>
> If two attachments have the same hash but different content, you'll end
> up with:

But if our hash were unique and based on the entire message then we 
should never get duplicated hash values? (obviously at the cost of more 
CPU and IO)

I sense I have misunderstood something, so please be gentle..?

Great feature anyway

Cheers

Ed W