Henrique Fernandes put forth on 4/10/2011 5:29 PM:
Thanks, but we did ran teste in the disks, and the problem was find file, not actuly writing or reading them, so we guess is a ofs2 problem, thats why
Hi Henrique,
Finding a file is a filesystem metadata operation, which entails walking the directory tree(s). Thus, finding a file generates IO to/from the disk array where the filesystem metadata resides. The IO requests are small, but there are many of them, and typically generate more head seeks than normal file read/write operations. Currently you don't have enough disk spindles in your array to keep up with the IOPS demands of both metadata and file operations.
now we are thinking in mdbox, to reduce the "find files" problem one thing
From a total IOPS perspective, there is not a huge difference between mdbox and maildir. And mdbox ties your indexes to the mail files, eliminating the flexibility of putting indexes on dedicated storage such as local SSD. If you really want to eliminate the massive IOPS load you currently have, switch to mbox. mbox storage generates almost no metadata IOPS at all.
that realy helped the performance was give lots of ram in the virtual machines where dovecot is running, with lots of ram, the kernel makes cache
Yes, a filesystem with 5 million maildir files is going to have a large directory tree to walk. If OCFS has a tweakable setting to force the amount of metadata that is cached, setting this as high as you possibly can (without destroying other things that need RAM) is a smart move. If not configurable, giving as much RAM as possible to each VM is also worthwhile, as you've noticed, as more more buffer space is allocated to filesystem metadata.
Worth noting: if you have multiple VM guest OSes on a single host, which all access the same OCFS filesystem, you will be unnecessarily caching OCFS data multiple times, wasting memory. If your hypervisor supports memory deduplication, as VMware ESX does, you should enable it, if you haven't already. If that is not an option, it's best to only run one OCFS guest per host. It may be possible to run OCFS in the hypervisor, and map a virtual filesystem to each guest on that host. Read your virtual platform documentation to find the right solution for your environment.
of "location" and it finds much faster the files. As matter fact the space provide to us, we might have to give back some of it. Buying hardware is not an option yet. So we keep thinking how to imprive performance tunning everything up.
Losing some space is fine as long as you don't lose spindles. Losing spindles decreases IOPS. Be sure that OCFS supports filesystem shrinking before the EMC OPs decrease the size of your LUN. And, make sure you've shutdown all of your OCFS hosts/guests _before_ they resize your LUN. If you don't, you'll possibly sustain corruption that will cause the loss of your entire OCFS filesystem. After they resize the LUN, bring one OCFS machine back online but without mounting the OCFS filesystem. Use the appropriate tool to shrink the filesystem, again, if OCFS can even do it. If not, do NOT let them resize your LUN. Make other arrangements.
I am thinking of chancge io schelduler also. But have not research a lot to try this yet, but it is in the plans.
Changing the Linux elevator has little effect in VM guests. Use the "noop" elevator in a VM environment, especially when using SAN storage. The Linux elevator cannot make head positioning decisions with SAN arrays--use noop.
The backup problem is because the machine that backups ( it joins the ocfs2 cluester just to backup files ) we are not able to make cache. Cause as much as we give RAM the bacula just eat up all ram we gave. I guess it is because of accurate option, still find a way to limit bacule ram use, so the kernel became able to cache some inodes.
Considering the nature of backup, I don't think the caching of metadata or files will decrease the IOPS load on the disk array. Backup is fairly sequential in nature, so caching anything would simply waste memory. I'm guessing what you really want is something like read ahead caching of metadata, so after finding one file, the next few directory table reads are from memory instead of disk. I don't know of a way to optimize for this.
Thanks for your reply and i am glad you remembered that old posts.
It's an interesting problem/scenario, and a couple of my interests/specialties are SAN storage arrays and filesystems.
But i still looking for some info about mdbox.
It's a hybrid between maildir and mbox, but with inbuilt indexes. maildir is one mail per file. mbox is one file with lots of mail. mdbox stores multiple mails per file across many files. Using mdbox will drop your IOPS load. Exactly how much I can't say, but it won't be anything near as large a drop as converting to mbox, which again, has almost zero metadata overhead.
Before we thought of using mdbox but we did not want to stick with dovecot, i mean, we like the ideia we could be able to change the imap server and etc. But as someone said, what are the other choises we have for opensource imap server ? Even with cyrus we still need to come acroos a big convertion, so it does not make much diference.
My only concern would be integrated indexes. Once you convert to mdbox you no longer have the flexibility of moving index files to fast dedicated storage.
Right now your best option for a quick decisive solution to your SAN/OCFS bottleneck is to move the maildir indexes off the SAN to fast local disks in the cluster hosts, either an SSD or 15k SAS drive per host. It's cheap, it will work, and you can implement and test it quickly at zero risk. It will be even better if you can use kernel 2.6.36 or later and XFS w/delaylog mount option on the local disks. Delaylog can reduce metadata write disk IOPS by an order of magnitude.
About the index files you said, SSD disks are not an possibility, but i though of using another partition as place for index files. Does it will make lot of diference ? As i said, we are using ldiretord with lbrlc that keeps track of ip address to some servers, is not ALWASY send it to the same server, but i tries to do it.
It will likely make a big difference in OCFS load. But...
The whole point of moving the index files to a local disk is to *dedicate* the entire performance of the disk to index file IOPS and remove that load from the SAN. An average decent quality SATA3 SSD today can perform about 50k random IOPS--far more than you actually need, but the best bang for the buck IOPS wise. A single 15k SAS drive can perform about 300 random IOPS, a 10k SAS drive about 200 random IOPS, and a 7.2k SATA drive performs about 150 random IOPS.
If you have 3 physical Dovecot hosts in your cluster, and can dedicate a 15k SAS drive on each for index file use only, you can potentially decrease the IOPS load on the EMC by 900. That should solve your OCFS performance problems pretty handily. That's a big win for less than $500 USD outlay on 3 15k SAS drives.
If you simply use leftover space on a currently installed local disk in each host, the amount of benefit you receive is completely dependent on the current seek load on that disk. I would do some investigation with iostat over a period of a few days to make sure said disk in each host has plenty of spare IOPS capacity. It's possible that you'll actually decrease overall Dovecot performance substantially if the disks you mention don't have enough spare performance.
Make sure you size the local disks to accommodate the current index files, with 50% headroom for load balancer stickiness overhead inefficiency and index growth. If you have 300GB total of index files currently on the EMC, a new 147GB 15k SAS drive in each of your 3 Dovecot hosts should suffice. 3.5" 15k 147GB Fujitsu Enterprise SAS drives can be obtained for as little as $155 USD. If your servers are HP/IBM/Dell and require their hot swap caged drives, you'll obviously pay quite a bit more.
-- Stan