On Tue, 2010-08-24 at 15:48 +0100, Ed W wrote:
Hi, thanks for responding
If you have a hash 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525, you can see if it exists with:
ls /attachments/35/16/hashes/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525
This would be great for a bunch of uses, if the hash were unique and determinable from just some other content. eg if I could take a random file from my filesystem, compute a hash from it, and then checking for the existence of /attachments/*/*/hashes/my_hash determines whether there are any messages referencing that attachment?
Yeah, that would work. And the */* part can also be optimized by just using the first 4 chars of the hash.
- When something gets corrupted... It's worth pondering about how we can audit and find unreferenced "blobs" later? Dovecot logs an error when it finds something unexpected. But there's not a whole lot it can do then. And finding such broken attachments .. well, I guess this'll already do it:
I was actually pondering the next step where there is some kind of single instance storage, say hardlinks, and you want to avoid a final dangling reference where there are no emails referencing that attachment? Such an issue depends on certain implementations of the single instance storage which finally develops, but it seems like a common problem to several of the obvious ways to do it?
Current implementation checks how many hard links are left for the hash while deleting it. If it's deleting the last reference then the final hashes/hash file is also deleted. It's of course possible that if it crashes between these two deletes then there are some dangling hashes left. I was thinking about maybe creating a tool to find and delete those. It's easily done by just deleting everything from /attachments/*/*/hashes/ directories that have a link count of 1.
A bigger problem is if a user's dbox directory is deleted/corrupted so that unused files are left in /attachments/*/*/ directory. Finding and deleting those pretty much requires reading through all dbox files for all users..
A related (but probably fairly rare) question would be (efficiently) finding all the messages which reference an attachment with a given unique hash.
Yeah. Not something I was planning on supporting.
I can contrive a few reasons to ask this question, but perhaps someone else will tell me they are dying to known this stuff?
(Find all emails which reference the company 2010 accounts pdf... Find all emails from employees with an attachment that matches our shadow password file...). Seems like we can do this just fine, only it will involve a lot of stats right now?
The mail -> attachment references are stored only inside dbox files in the metadata fields.
links to the files. This requires being able to give users a coded link which can later be decoded to refer to a specific attachment. If this change offered us additional ways to find attachments by hash or whatever then it would be extremely useful I'm not sure if this change will help much. If the attachment changes (especially in size) there will be problems..
A unique hash would allow me to give out very simple URL links to customers, eg http://mysite/attachments/SHA_Hash
Ah. Yeah, that would work. You could just give the hash-guid reference so that the link will no longer work after the message gets deleted. Although the attachment hash-guid isn't available in any easy way, so you'd have to add some extra code.
Related to that, I've been thinking of a transparent caching Dovecot proxy.
I could be interested in sponsoring such work if it got it higher up the ToDo list?! Please contact me with a proposal?
The related parts are:
Writing IMAP client lib-storage backend (possibly supporting, but not requiring some Dovecot-specific extensions). But this can be annoyingly slow, because lib-storage API is synchronous. So it needs:
Change lib-storage API to allow backends to be asynchronous. This is wanted also for other high-latency backends, like key-value databases. It could also improve everyone elses' performance by adding async disk I/O support.
Add a caching proxy lib-storage backend that supports transparent caching messages and/or indexes for other lib-storage backends.
I'm interested in 2. anyway. Also 1 is probably a nice and easy way to test that it works, and 1 is also going to be nice for using dsync to migrate to Dovecot from other random IMAP servers :) Then 3 probably won't be all that difficult to implement. Anyway, I'm full time employed until the end of November, although I'm not exactly sure what I'll be working on soon..
I'm not sure if with your current proposal I can do 100% of the above? For example it's not clear if 4) is still possible? Also without a "guaranteed" hash we can't use the hash as a lookup key in a key/value storage system (which implies another mapping of keys to keys is required). Yeah, attachment-instance-key -> attachment-key -> attachment data lookup would be the only safe way to do this.
However, if the hash were a hash of the full message then we could completely avoid the double indirection? Within the limits of sensible probability, hashes can be considered unique and so there is "no" possibility of collision.
The hash is already a full hash of the message. I don't really like the idea of trusting that a hash is unique. Especially because this could be attacked against. Someone could read another user's attachment if they only knew its hash and then were able to create another file with the same hash and send it to themselves in the same system. (Sure, this would require someone breaking SHA1. But the attachments with their SHA1 hashes could exist for many more years.)
I might make Dovecot trust the hash optionally anyway, but not unconditionally.