[Dovecot] Single instance storage - testing please
http://hg.dovecot.org/dovecot-2.0-sis contains the code for it. Otherwise it's the latest (as of writing this) dovecot-2.0 hg tree. Please test if you're interested in SIS. :)
Once there's at least some testing, I'll probably add this to v2.0.x since very little of this new code is used when SIS is disabled (which is the default of course).
SIS works pretty much like explained in http://dovecot.org/list/dovecot/2010-July/050832.html and http://dovecot.org/list/dovecot/2010-July/050992.html
Two things I'm not yet entirely sure about:
What hash algorithm to use? Currently it's hard coded to SHA1. Besides more CPU usage, the other potential problem with larger hashes is that they also generate larger filenames. The filenames are currently hex-encoded, but to save space they could be changed to some kind of modified-base64 (base64 uses '/' chars, so it can't be regular base64). Example filename lengths:
hex modified-base64
SHA1 73 50 SHA256 97 66 SHA512 161 109
Yet another possibility would be to use SHA256/SHA512 and just truncate the hash length to less number of bits.
- Should I add support for trusting hash uniqueness and to avoid disk I/O generated by the byte-by-byte comparison? It could still first check that the file sizes match.
Usage
You can enable SIS for sdbox and mdbox:
mail_attachment_dir = /var/attachments
Just setting the above enables "instant SIS", where byte-by-byte comparison is done immediately during saving mails. Alternative is to leave the comparing later by setting:
mail_attachment_fs = sis-queue /var/attachments/queue:posix
This does no deduplication itself yet. To do that you'll need a nighty (or whatever) run, which calls:
doveadm sis deduplicate /var/attachments /var/attachments/queue
There's also a feature to easily find all attachments based on a hash. For example:
% sha1sum foo 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525 foo % doveadm sis find /var/attachments 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525 /var/attachments/35/16/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525-e13a841f28ba764c123b00008c4a11c1 /var/attachments/35/16/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525-1d3b940628ba764c0b3b00008c4a11c1
If you want to save attachments to a separate files without SIS (e.g. you want to use your filesystems deduplication), set:
mail_attachment_fs = posix
By default only attachments larger than 128 kB are written to attachment storage. You can change it from:
mail_attachment_min_size = 128k
It's also possible to create a plugin that adds further restrictions to when the attachment is saved separately. This might be useful to reduce disk seeks for attachments that are typically shown inline by clients/webmail. You can do this by overriding mailbox.save_is_attachment() method.
If you want to distribute attachments to multiple filesystems, just create /var/attachments/[0-9a-f][0-9a-f] as symlinks pointing to whatever mount paths you want.
On Thu, 2010-08-26 at 20:32 +0100, Timo Sirainen wrote:
http://hg.dovecot.org/dovecot-2.0-sis contains the code for it. Otherwise it's the latest (as of writing this) dovecot-2.0 hg tree. Please test if you're interested in SIS. :)
One more point that I have to remember to mention once I write its wiki page: The attachment handling code is NFS safe, because it never modifies existing files. So there won't be problems with using director to distribute users to different servers and still all servers accessing the common attachment storage.
Another thing I just remembered: The code currently uses 0600 / 0700 permissions for everything. I guess it should take the permissions from /attachments directory and preserve them for all the subdirs/files.
On 08/26/2010 04:41 PM, Mike Abbott wrote:
- What hash algorithm to use?
- Should I add support for trusting hash uniqueness
Use two hash functions and concatenate the hashes. While both hash systems may eventually be hacked it is unlikely that hacking them will result in a targeted alias.
Just make it possible to change the hash in the future. Have a utility that updates all (or a subset) of them.
If e.g. SHA256 is truly broken in the future, the utility can run overnight while I fix the million other emergencies that are about to exist in the morning.
On 27.8.2010, at 1.52, Michael Orlitzky wrote:
On 08/26/2010 04:41 PM, Mike Abbott wrote:
- What hash algorithm to use?
- Should I add support for trusting hash uniqueness
Use two hash functions and concatenate the hashes. While both hash systems may eventually be hacked it is unlikely that hacking them will result in a targeted alias.
Just make it possible to change the hash in the future.
I'm thinking about mail_attachment_hash setting where you can configure it pretty much any way you want.
Have a utility that updates all (or a subset) of them.
That won't be necessary. Once the hash changes, the new files are created with new hash function and it doesn't matter if the old hash is broken because you can't generate new files with it anymore anyway.
On 08/26/2010 09:00 PM, Timo Sirainen wrote:
On 27.8.2010, at 1.52, Michael Orlitzky wrote:
On 08/26/2010 04:41 PM, Mike Abbott wrote:
- What hash algorithm to use?
- Should I add support for trusting hash uniqueness
Use two hash functions and concatenate the hashes. While both hash systems may eventually be hacked it is unlikely that hacking them will result in a targeted alias.
Just make it possible to change the hash in the future.
I'm thinking about mail_attachment_hash setting where you can configure it pretty much any way you want.
Have a utility that updates all (or a subset) of them.
That won't be necessary. Once the hash changes, the new files are created with new hash function and it doesn't matter if the old hash is broken because you can't generate new files with it anymore anyway.
Won't files hashed with the old function begin to dupe though?
On 27.8.2010, at 2.24, Michael Orlitzky wrote:
Have a utility that updates all (or a subset) of them.
That won't be necessary. Once the hash changes, the new files are created with new hash function and it doesn't matter if the old hash is broken because you can't generate new files with it anymore anyway.
Won't files hashed with the old function begin to dupe though?
You mean new hash would become a duplicate of the old? Well ..
It's highly unlikely to happen, especially because with the new hash function there again shouldn't be a way to create any specific hash.
As long as byte-by-byte comparison is always done, collisions don't matter much anyway (if you can reliably reproduce them, that could lead to some kind of DoS by filling the filesystem, but again once hash function is changed this couldn't be done anymore).
The filename can be made different, making the collision impossible. Either because of different hash length or by manually adding some specific character there.
On 08/26/2010 09:38 PM, Timo Sirainen wrote:
On 27.8.2010, at 2.24, Michael Orlitzky wrote:
Have a utility that updates all (or a subset) of them.
That won't be necessary. Once the hash changes, the new files are created with new hash function and it doesn't matter if the old hash is broken because you can't generate new files with it anymore anyway.
Won't files hashed with the old function begin to dupe though?
You mean new hash would become a duplicate of the old? Well ..
- It's highly unlikely to happen, especially because with the new hash function there again shouldn't be a way to create any specific hash.
Oh, no, that's not what I meant.
I mean, my friend sends me a video of two cats cuddling, and it gets MD5 hashed and stored somewhere (it's the first instance of that file in my SIS). Tomorrow, I read the newspaper from 2005 explaining how MD5 is broken, and decide to switch my hash function to MD4 for safety reasons. A week later, another friend sends me the same video (it's REALLY cute). Doesn't the video get stored again?
On 27.8.2010, at 2.52, Michael Orlitzky wrote:
Won't files hashed with the old function begin to dupe though?
You mean new hash would become a duplicate of the old? Well ..
- It's highly unlikely to happen, especially because with the new hash function there again shouldn't be a way to create any specific hash.
Oh, no, that's not what I meant.
I mean, my friend sends me a video of two cats cuddling, and it gets MD5 hashed and stored somewhere (it's the first instance of that file in my SIS). Tomorrow, I read the newspaper from 2005 explaining how MD5 is broken, and decide to switch my hash function to MD4 for safety reasons. A week later, another friend sends me the same video (it's REALLY cute). Doesn't the video get stored again?
Oh. Yeah, it gets duplicated once. I don't think that's a big deal.
participants (3)
-
Michael Orlitzky
-
Mike Abbott
-
Timo Sirainen