Hi.
I recently mentioned in several posts, that I'd tended to use mbox rather than maildir, because you don't loose so much space (due to always allocating full blocks per maildir file and thus per mail).
I made some tests of my archive, which consists of some 3,4 million mails at a total of 42GB). Most of these mails are probably normal sized, but there are also some with bigger attachments.
For those who are interested here are the results:
I used a 53687091200 B image file (via loop device) and tested ext4 only. btrfs is IMHO not yet ready, I have had often issues with XFS (corruptions), reiser4 is more or less dead and reiser3 is said to have issues (see e.g. its wikipedia article, even though it has that mode for small files which would fit nicely).
As you see the number of mails increased a bit, cause I tested over several days... but this is only a very small increase so it shouldn't change the numbers a lot.
- Original mbox archives (right now in Evolution) mbox exact space: 38122676224 (does not include meta-data) mbox guess space: 44625670144 (includes Evolution meta-data which is several GBs) mbox num mails: 3412999 (occurances of From_ lines)
In the following:
- image file, 1B-blocks, Used_begin, Used_end, Available_begin, Available_end result out of df -B 1
- mdir exact used space is the sum of du -B 1 for each regular file (i.e. each mdir file)
- mdir guess used space du -B 1 on the root dir of the filesystem
- mdir num mails: find . type -f | wc -l on the root dir of the filesystem
- EXT4 with 4096 blocks: image file: 53687091200 1B-blocks: 52844687360 Used_begin: 188555264 Used_end: 45198778368 Available_begin: 49971777536 Available_end: 2444972032
mdir exact used space: 44810866688 mdir guess used space: 45010243584 mdir num mails: 3423296
delta: 6.688190464 G delta / mail: 1953 B
- EXT4 with 2048 blocks: image file: 53687091200 1B-blocks: 50324295680 Used_begin: 82857984 Used_end: 41598846976 Available_begin: 47557083136 Available_end: 6041094144
mdir exact used space: 41323991040 mdir guess used space: 41516007424 mdir num mails: 3425033
delta: 3.201314816 G delta / mail: 934 B
- EXT4 with 1024 blocks: image file: 53687091200 1B-blocks: 50314834944 Used_begin: 38287360 Used_end: 39909360640 Available_begin: 47592193024 Available_end: 7721119744
mdir exact used space: 39683908608 mdir guess used space: 39871086592 mdir num mails: 3425033
delta: 1.561232384 G delta / mail: 455 B
As you can see, the delta per mail is rather close to the statistically expected values of 2048B, 1024B and 512B.
In the end I probably changed my opinion. ~7GB of wasted block space for all my mails is actually quite a lot, but in days of cheap disk space it's acceptable. And with mbox one has IMHO the major disadvantage that mailservers (including dovecot) store some meta-data _in_ it (i.e. in the mails themselves) , which I don't like a lot. I still think about reports that mbox is much faster with full text search (which sounds reasonable)... but therefore one needs probably and database backend anyway.
HTH, Chris.