[Dovecot] Delayed deduplication daemon design [was: Re: (Single instance) attachment storage]

Timo Sirainen tss at iki.fi
Thu Jul 22 22:23:59 EEST 2010


On Wed, 2010-07-21 at 21:19 +0100, Timo Sirainen wrote:

>  - delayed deduplication daemon?

Design:

Create a new "sis-delayed" backend for attachment fs. When creating new
attachment files, it creates "hash-guid" files just like regular posix
backend, but it also creates zero byte sized files with the same
filenames under a different directory, e.g. creating a new attachment
attachments/a8/f9/a8f91247218942-12247198278412 creates also zero sized
attachments/delayed/a8f91247218942-12247198278412 file.

Deletion step works the same way as with fs-sis, so it deletes files
from hashes/ directory automatically.

Everything else is a wrapper to super fs backend, just like with fs-sis.

A nightly run readdir()s through the delayed/ directory and processes
each file it finds, deleting the file after processing it.

 a) If the file is already gone from the attachment fs or its link count
is larger than 1, it's skipped.
 b) If file's hash doesn't exist in hashes/ directory, link() the file
to hashes/hash. Handle EMLINK the same as d)
 c) If byte-by-byte comparison finds that the file is the same as in
hashes/ directory, replace file with the existing one: link() it to a
temp file, make sure that the temp file's inode is still the same as the
file we used for comparing, and then rename() it over the attachment
file.
 d) Otherwise (hash collision/too many links), replace the hashes/ file
with the new file: link() attachment to hashes/temp and rename() it to
hashes/hash.

Simple, quite efficient, NFS safe. :) Some other thoughts:

It would be possible to have each server run this at the same time
without doing duplicate work by before processing a file, rename() it
under delayed/hostnames/<hostname>/ directory and later delete it from
there. Also at startup go through any such files in the directory in
case the previous run crashed. Perhaps also all hostname directories
should be stat()ed once in a while and if their mtimes are too old,
someone else could go look inside if there are any files and process
them. The hostname/ directories could always be rmdired when the last
file in them is gone.

So what binary/process should be doing this? "doveadm sis" command
maybe?.. But would there be any other useful commands except for this
deduplication? Maybe:

 - doveadm sis deduplicate
 - doveadm sis cleanup (for going through and deleting any files from
hashes/ directories that have link count=1)
 - anything else?..



More information about the dovecot mailing list