http://hg.dovecot.org/dovecot-2.0-sis contains the code for it. Otherwise it's the latest (as of writing this) dovecot-2.0 hg tree. Please test if you're interested in SIS. :)
Once there's at least some testing, I'll probably add this to v2.0.x since very little of this new code is used when SIS is disabled (which is the default of course).
SIS works pretty much like explained in http://dovecot.org/list/dovecot/2010-July/050832.html and http://dovecot.org/list/dovecot/2010-July/050992.html
Two things I'm not yet entirely sure about:
What hash algorithm to use? Currently it's hard coded to SHA1. Besides more CPU usage, the other potential problem with larger hashes is that they also generate larger filenames. The filenames are currently hex-encoded, but to save space they could be changed to some kind of modified-base64 (base64 uses '/' chars, so it can't be regular base64). Example filename lengths:
hex modified-base64
SHA1 73 50 SHA256 97 66 SHA512 161 109
Yet another possibility would be to use SHA256/SHA512 and just truncate the hash length to less number of bits.
- Should I add support for trusting hash uniqueness and to avoid disk I/O generated by the byte-by-byte comparison? It could still first check that the file sizes match.
Usage
You can enable SIS for sdbox and mdbox:
mail_attachment_dir = /var/attachments
Just setting the above enables "instant SIS", where byte-by-byte comparison is done immediately during saving mails. Alternative is to leave the comparing later by setting:
mail_attachment_fs = sis-queue /var/attachments/queue:posix
This does no deduplication itself yet. To do that you'll need a nighty (or whatever) run, which calls:
doveadm sis deduplicate /var/attachments /var/attachments/queue
There's also a feature to easily find all attachments based on a hash. For example:
% sha1sum foo 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525 foo % doveadm sis find /var/attachments 351641b73feb7cf7e87e5a8c3ca9a37d7b21e525 /var/attachments/35/16/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525-e13a841f28ba764c123b00008c4a11c1 /var/attachments/35/16/351641b73feb7cf7e87e5a8c3ca9a37d7b21e525-1d3b940628ba764c0b3b00008c4a11c1
If you want to save attachments to a separate files without SIS (e.g. you want to use your filesystems deduplication), set:
mail_attachment_fs = posix
By default only attachments larger than 128 kB are written to attachment storage. You can change it from:
mail_attachment_min_size = 128k
It's also possible to create a plugin that adds further restrictions to when the attachment is saved separately. This might be useful to reduce disk seeks for attachments that are typically shown inline by clients/webmail. You can do this by overriding mailbox.save_is_attachment() method.
If you want to distribute attachments to multiple filesystems, just create /var/attachments/[0-9a-f][0-9a-f] as symlinks pointing to whatever mount paths you want.