On 8/21/2013 4:07 PM, Jan-Frode Myklebust wrote:
I would strongly suggest using mdbox instead. AFAIK clusterfs' aren't
I'd recommend mdbox as well, with a healthy rotation size. The larger files won't increase IMAP performance substantially but they can make backup significantly quicker.
very good at handling many small files. It's a worst case random I/O usage pattern, with high rate of metadata operations on top.
Just for clarification, small files and random IO patterns at the disks are only a small fraction of the maildir problem. The majority of it is metadata--the create, move, rename, etc operations. To keep the in-memory filesystem state consistent across all nodes, and to avoid putting extra IOPS on the storage if on disk data structures were to be used for synchronization, cluster filesystems exchange all metadata updates and synchronization data over the cluster interconnect. This is inherently slow.
With a local filesystem and multiple processes, this coherence dance takes place at DRAM latencies--tens of nanoseconds, and scales well as load increases because DRAM bandwidth is 25-100 GB/s. With a cluster filesystem it takes place at interconnect latency, tens to hundreds of μs, or about 1000x higher latency. And it doesn't scale well as bandwidth is limited to ~100 MB/s with GbE, ~1 GB/s with 10GbE or Myrinet. Stepping up to Infiniband 4x DDR can get you ~2 GB/s and slightly lower latency, but that's a lot of extra expense for a mail cluster, given the performance won't scale with the $$ spent. The switch and HBAs will cost more than the COTS servers.
Selecting the right mailbox format is in essence free, and mostly solves the maildir metadata and IOPS problem.
We use IBM GPFS for clusterfs, and have finally completed the conversion of a 130+ million inode maildir filesystem, into a 18 million inode mdbox filesystem. I have no hard performance data showing the difference between maildir/mdbox, but at a minimum mdbox is much easier to manage. Backup of 130+ million files is painfull.. and also it feels nice to be able do schedule batches of mailbox purges to off-hours, instead of doing them at peak hours.
130m to 18m is 'only' a 7 fold decrease. 18m inodes is still rather large for any filesystem, cluster or local. A check on an 18m inode XFS filesystem, even on fast storage, would take quite some time. I'm sure it would take quite a bit longer to check a GFS2 with 18m inodes. Any reason you didn't go a little larger with your mdbox rotation size?
-- Stan