Sven Hartge <sven@svenhartge.de> wrote:
Interesting datapoint: NetApp Deduplication did only recover about 1% of storage space with mdbox-based mail storage, while on an maildir-based mail storage, the rate was about 15%. (This was tested with a copy of real user data, so is accurate for my workload.)
Just a guess, but I expect the difference is because NetApp de-dupes by checksumming blocks and mark whole blocks as duplicates if they have the same checksum.
The message body has the same block offset in maildir (i.e. the start of a message is at byte 0), whereas mdbox might align message body anywhere in a block, so you might have 512 different block configurations for the same message.
I don't know whether message alignment would be a worthwhile optimization for mdbox.
Joseph Tam <jtam.home@gmail.com>