On Fri, Feb 5, 2010 at 4:36 PM, Timo Sirainen <tss@iki.fi> wrote:
I was wondering if I should add compression support to mdbox one mail at a time or one file (~2MB) at a time. The tradeoffs are:
- one mail at a time allows quickly seeking to wanted mail inside the file, but it can't compress mails as well
- one file at a time compresses better, but seeking is slow because it can only be done by uncompressing all the data until the wanted offset is reached
I did a quick test for this with 27 MB of my old INBOX mails:
(note the -b option, so it doesn't count wasted fs space) mdbox/storage% du -sb . 15120350 .
Maildir/cur% du -sb . 16517320 .
% echo 1-15120350/16517320|bc -l .08457606924125705623
So, compressed mdboxes take 8.5% less space. This was with regular gzip compression with default level. With bzip2 -9 compression the difference was 10%.
Any thoughts on if 8-10% is significant enough improvement to make seeking performance worse? Or perhaps I should just implement both ways.. :)
Isn't the real difference even smaller?
15120350/28311552 = .534 16517320/28311552 = .583
So that's just under 5%.
Either way, I'd say go with compressing each mail individually for quick seeking.
Also, if you were compressing the whole file of mails as a single stream, wouldn't you have to recompress and rewrite the whole file for each new mail delivered?
Matt