[Dovecot] Questions about converting maildir to mdbox.

Stan Hoeppner stan at hardwarefreak.com
Mon Apr 11 08:25:09 EEST 2011

Henrique Fernandes put forth on 4/10/2011 5:29 PM:
> Thanks, but we did ran  teste in the disks, and the problem was find file,
> not actuly writing or reading them, so we guess is a ofs2 problem, thats why

Hi Henrique,

Finding a file is a filesystem metadata operation, which entails walking
the directory tree(s).  Thus, finding a file generates IO to/from the
disk array where the filesystem metadata resides.  The IO requests are
small, but there are many of them, and typically generate more head
seeks than normal file read/write operations.  Currently you don't have
enough disk spindles in your array to keep up with the IOPS demands of
both metadata and file operations.

> now we are thinking in mdbox, to reduce the "find files" problem one thing

>From a total IOPS perspective, there is not a huge difference between
mdbox and maildir.  And mdbox ties your indexes to the mail files,
eliminating the flexibility of putting indexes on dedicated storage such
as local SSD.  If you really want to eliminate the massive IOPS load you
currently have, switch to mbox.  mbox storage generates almost no
metadata IOPS at all.

> that realy helped the performance was give lots of ram in the virtual
> machines where dovecot is running, with lots of ram, the kernel makes cache

Yes, a filesystem with 5 million maildir files is going to have a large
directory tree to walk.  If OCFS has a tweakable setting to force the
amount of metadata that is cached, setting this as high as you possibly
can (without destroying other things that need RAM) is a smart move.  If
not configurable, giving as much RAM as possible to each VM is also
worthwhile, as you've noticed, as more more buffer space is allocated to
filesystem metadata.

Worth noting:  if you have multiple VM guest OSes on a single host,
which all access the same OCFS filesystem, you will be unnecessarily
caching OCFS data multiple times, wasting memory.  If your hypervisor
supports memory deduplication, as VMware ESX does, you should enable it,
if you haven't already.  If that is not an option, it's best to only run
one OCFS guest per host.  It may be possible to run OCFS in the
hypervisor, and map a virtual filesystem to each guest on that host.
Read your virtual platform documentation to find the right solution for
your environment.

> of "location" and it finds much faster the files. As matter fact the space
> provide to us, we might have to give back some of it. Buying hardware is not
> an option yet. So we keep thinking how to imprive performance tunning
> everything up.

Losing some space is fine as long as you don't lose spindles.  Losing
spindles decreases IOPS.  Be sure that OCFS supports filesystem
shrinking before the EMC OPs decrease the size of your LUN.  And, make
sure you've shutdown all of your OCFS hosts/guests _before_ they resize
your LUN.  If you don't, you'll possibly sustain corruption that will
cause the loss of your entire OCFS filesystem.  After they resize the
LUN, bring one OCFS machine back online but without mounting the OCFS
filesystem.  Use the appropriate tool to shrink the filesystem, again,
if OCFS can even do it.  If not, do NOT let them resize your LUN.  Make
other arrangements.

> I am thinking of chancge io schelduler also. But have not research a lot to
> try this yet, but it is in the plans.

Changing the Linux elevator has little effect in VM guests.  Use the
"noop" elevator in a VM environment, especially when using SAN storage.
 The Linux elevator cannot make head positioning decisions with SAN
arrays--use noop.

> The backup problem is because the machine that backups ( it joins the ocfs2
> cluester just to backup files ) we are not able to make cache. Cause as much
> as we give RAM the bacula just eat up all ram we gave. I guess it is because
> of accurate option, still find a way to limit bacule ram use, so the kernel
> became able to cache some inodes.

Considering the nature of backup, I don't think the caching of metadata
or files will decrease the IOPS load on the disk array.  Backup is
fairly sequential in nature, so caching anything would simply waste
memory.  I'm guessing what you really want is something like read ahead
caching of metadata, so after finding one file, the next few directory
table reads are from memory instead of disk.  I don't know of a way to
optimize for this.

> Thanks for your reply and i am glad you remembered that old posts.

It's an interesting problem/scenario, and a couple of my
interests/specialties are SAN storage arrays and filesystems.

> But i still looking for some info about mdbox.

It's a hybrid between maildir and mbox, but with inbuilt indexes.
maildir is one mail per file.  mbox is one file with lots of mail.
mdbox stores multiple mails per file across many files.  Using mdbox
will drop your IOPS load.  Exactly how much I can't say, but it won't be
anything near as large a drop as converting to mbox, which again, has
almost zero metadata overhead.

> Before we thought of using mdbox but we did not want to stick with dovecot,
> i mean, we like the ideia we could be able to change the imap server and
> etc. But as someone said, what are the other choises we have for opensource
> imap server ? Even with cyrus we still need to come acroos a big convertion,
> so it does not make much diference.

My only concern would be integrated indexes.  Once you convert to mdbox
you no longer have the flexibility of moving index files to fast
dedicated storage.

Right now your best option for a quick decisive solution to your
SAN/OCFS bottleneck is to move the maildir indexes off the SAN to fast
local disks in the cluster hosts, either an SSD or 15k SAS drive per
host.  It's cheap, it will work, and you can implement and test it
quickly at zero risk.  It will be even better if you can use kernel
2.6.36 or later and XFS w/delaylog mount option on the local disks.
Delaylog can reduce metadata write disk IOPS by an order of magnitude.

> About the index files you said, SSD disks are not an possibility, but i
> though of using another partition as place for index files. Does it will
> make lot of diference ? As i said, we are using ldiretord with lbrlc that
> keeps track of ip address to some servers, is not ALWASY send it to the same
> server, but i tries to do it.

It will likely make a big difference in OCFS load.  But...

The whole point of moving the index files to a local disk is to
*dedicate* the entire performance of the disk to index file IOPS and
remove that load from the SAN.  An average decent quality SATA3 SSD
today can perform about 50k random IOPS--far more than you actually
need, but the best bang for the buck IOPS wise.  A single 15k SAS drive
can perform about 300 random IOPS, a 10k SAS drive about 200 random
IOPS, and a 7.2k SATA drive performs about 150 random IOPS.

If you have 3 physical Dovecot hosts in your cluster, and can dedicate a
15k SAS drive on each for index file use only, you can potentially
decrease the IOPS load on the EMC by 900.  That should solve your OCFS
performance problems pretty handily.  That's a big win for less than
$500 USD outlay on 3 15k SAS drives.

If you simply use leftover space on a currently installed local disk in
each host, the amount of benefit you receive is completely dependent on
the current seek load on that disk.  I would do some investigation with
iostat over a period of a few days to make sure said disk in each host
has plenty of spare IOPS capacity.  It's possible that you'll actually
decrease overall Dovecot performance substantially if the disks you
mention don't have enough spare performance.

Make sure you size the local disks to accommodate the current index
files, with 50% headroom for load balancer stickiness overhead
inefficiency and index growth.  If you have 300GB total of index files
currently on the EMC, a new 147GB 15k SAS drive in each of your 3
Dovecot hosts should suffice.  3.5" 15k 147GB Fujitsu Enterprise SAS
drives can be obtained for as little as $155 USD.  If your servers are
HP/IBM/Dell and require their hot swap caged drives, you'll obviously
pay quite a bit more.


More information about the dovecot mailing list