On Wed, 2010-07-21 at 21:19 +0100, Timo Sirainen wrote:
- delayed deduplication daemon?
Design:
Create a new "sis-delayed" backend for attachment fs. When creating new attachment files, it creates "hash-guid" files just like regular posix backend, but it also creates zero byte sized files with the same filenames under a different directory, e.g. creating a new attachment attachments/a8/f9/a8f91247218942-12247198278412 creates also zero sized attachments/delayed/a8f91247218942-12247198278412 file.
Deletion step works the same way as with fs-sis, so it deletes files from hashes/ directory automatically.
Everything else is a wrapper to super fs backend, just like with fs-sis.
A nightly run readdir()s through the delayed/ directory and processes each file it finds, deleting the file after processing it.
a) If the file is already gone from the attachment fs or its link count is larger than 1, it's skipped. b) If file's hash doesn't exist in hashes/ directory, link() the file to hashes/hash. Handle EMLINK the same as d) c) If byte-by-byte comparison finds that the file is the same as in hashes/ directory, replace file with the existing one: link() it to a temp file, make sure that the temp file's inode is still the same as the file we used for comparing, and then rename() it over the attachment file. d) Otherwise (hash collision/too many links), replace the hashes/ file with the new file: link() attachment to hashes/temp and rename() it to hashes/hash.
Simple, quite efficient, NFS safe. :) Some other thoughts:
It would be possible to have each server run this at the same time without doing duplicate work by before processing a file, rename() it under delayed/hostnames/<hostname>/ directory and later delete it from there. Also at startup go through any such files in the directory in case the previous run crashed. Perhaps also all hostname directories should be stat()ed once in a while and if their mtimes are too old, someone else could go look inside if there are any files and process them. The hostname/ directories could always be rmdired when the last file in them is gone.
So what binary/process should be doing this? "doveadm sis" command maybe?.. But would there be any other useful commands except for this deduplication? Maybe:
- doveadm sis deduplicate
- doveadm sis cleanup (for going through and deleting any files from hashes/ directories that have link count=1)
- anything else?..