Hi, thanks for responding
If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see if it exists with:
ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525
This would be great for a bunch of uses, if the hash were unique and determinable from just some other content. eg if I could take a random file from my filesystem, compute a hash from it, and then checking for the existence of /attachments/*/*/hashes/my_hash determines whether there are any messages referencing that attachment?
format with high linear read rates and most attachments in a some key/value storage system? The attachment I/O is done via filesystem API, so this would be possible easily by just writing FS API backend for a key-value database.
Cool
- When something gets corrupted... It's worth pondering about how we can audit and find unreferenced "blobs" later? Dovecot logs an error when it finds something unexpected. But there's not a whole lot it can do then. And finding such broken attachments .. well, I guess this'll already do it:
I was actually pondering the next step where there is some kind of single instance storage, say hardlinks, and you want to avoid a final dangling reference where there are no emails referencing that attachment? Such an issue depends on certain implementations of the single instance storage which finally develops, but it seems like a common problem to several of the obvious ways to do it?
A related (but probably fairly rare) question would be (efficiently)
finding all the messages which reference an attachment with a given
unique hash. I can contrive a few reasons to ask this question, but
perhaps someone else will tell me they are dying to known this stuff?
(Find all emails which reference the company 2010 accounts pdf... Find
all emails from employees with an attachment that matches our shadow
password file...). Seems like we can do this just fine, only it will
involve a lot of stats right now?
links to the files. This requires being able to give users a coded link which can later be decoded to refer to a specific attachment. If this change offered us additional ways to find attachments by hash or whatever then it would be extremely useful I'm not sure if this change will help much. If the attachment changes (especially in size) there will be problems..
A unique hash would allow me to give out very simple URL links to customers, eg http://mysite/attachments/SHA_Hash
At present I have a fairly convoluted scheme which uses message ids and this has a whole bunch of issues...
Related to that, I've been thinking of a transparent caching Dovecot proxy.
I could be interested in sponsoring such work if it got it higher up the ToDo list?! Please contact me with a proposal?
I'm not sure if with your current proposal I can do 100% of the above? For example it's not clear if 4) is still possible? Also without a "guaranteed" hash we can't use the hash as a lookup key in a key/value storage system (which implies another mapping of keys to keys is required). Yeah, attachment-instance-key -> attachment-key -> attachment data lookup would be the only safe way to do this.
However, if the hash were a hash of the full message then we could completely avoid the double indirection? Within the limits of sensible probability, hashes can be considered unique and so there is "no" possibility of collision.
If you were to use highly unique hashes as your keys then you can dispense with certain levels of indirection here and hashes would be "guaranteed" unique if the content is unique. This allows you to do a straight compare of all hashes anywhere and two hashes the same == same content
(I'm caveating all "unique" claims in case someone points out that it's theoretically possible to get collisions)Well .. the way it works is that you have files:
hash-guid hash2-guid2 hashes/hash hashes/hash2
If two attachments have the same hash but different content, you'll end up with:
But if our hash were unique and based on the entire message then we should never get duplicated hash values? (obviously at the cost of more CPU and IO)
I sense I have misunderstood something, so please be gentle..?
Great feature anyway
Cheers
Ed W