Eric Rostetter put forth on 12/11/2010 9:48 AM:
Well, it is true I know nothing about vmware/ESX. I know in my virtual machine setups, I _can_ give the virtual instances access to devices which are not used by other virtual instances. This is what I would do. Yes, it is still virtualized, but it is dedicated, and should still perform pretty well -- faster than shared storage, and in the case of SSD faster than normal disk or iscsi.
He's running an ESX cluster, which assumes use of HA and Vmotion. For Vmotion to work, each node in the cluster must have direct hardware access to every storage device. Thus, to use an SSD, it would have to be installed in Javier's iSCSI SAN array. Many iSCSI arrays are relatively inexpensive and don't offer SSD support.
However, Javier didn't ask for ways to increase his I/O throughput. He asked for the opposite. I assume this is because they have a 1 GbE based ethernet SAN, and probably only 2 or 4, GbE ports on the SAN array controller. With only 200 to 400MB/s bidirectional bandwidth, and many busy guests in the eSX farm, probably many applications besides Dovecot, Javier's organization is likely coming close to bumping the up against the bandwidth limits of the 1 GbE links on the SAN array controller. Thus, adding an SSD to the mix would exacerbate the I/O problem.
Thus, putting the index files in a ramdisk or using the Deovecot memory only index file parameter are really his only two options that I can think of that will help in the way he desires.
He was already asking about throwing memory at the problem, and I think he implied he had a lot of memory. As such, the caching is there already. Your statement is true, but it is also a "zero config" option if he really does have lots of memory in the machine.
He has physical memory available, but he isn't currently assigning it to the Dovecot guest. To do so would require changing the memory setting in ESX for this guest, then rebooting the guest (unless both ESX and his OS support hot plug memory--I don't know if ESX does). This is what Javier was referring to when stating "adding memory".
And in ext3, the flush rate. Good point, that I forgot about. It is set to a very small value by default (2-3 seconds maybe), and can be increased without too much danger (to say 10-30 seconds).
Just to be clear and accurate here, and it's probably a little OT to the thread, XFS delaylog isn't designed to decrease filesystem log I/O activity. It was designed to dramatically increase the rate of write operations to the journal log--metadata operations--and the I/O efficiency for metadata ops.
The major visible benefit of this is a massive increase in delete performance for many tens of thousands (or more) of files. It decreases journal log file fragmentation as more writes can be packed into each inode due to in memory organization before the physical write. This packing thus decreases physical disk I/O as fewer, larger blocks are written per I/O. XFS with delaylog is an excellent match for maildir storage. It won't help much at all with mbox, very slightly more with mdbox.
XFS delaylog is a _perfect_ match for the POP3 workload. Each time a user pulls, then deletes all messages, delaylog will optimize and then burst the metadata journal write operations to disk, again, with far fewer physical I/Os due to the inode optimization.
XFS with delaylog is now much faster than any version of ReiserFS, whose claim to fame was lighting fast mass file deletion. As of 2.6.36, XFS is now the fastest filesystem, and not just on Linux, for almost any workload. This assuming real storage hardware that can take handle massive parallelization of reads and writes. EXT3 is still faster on a single disk system. But EXT3 is the "everyman" OS, which is optimized more for the single disk case. XFS was and still is designed for large parallel servers with big fast storage.
Assuming normal downtime stats, this would still be a huge win. Since the machine rarely goes down, it would rarely need to rebuild indexes, and hence would only run poorly a very small percentage of the time. Of course, it could run _very_ poorly right after a reboot for a while, but then will be back to normal soon enough.
I totally concur.
One way to help mitigate this if using a RAM disk is have your shutdown script flush the RAM disk to physical disk (after stoping dovecot) and the reload it to RAM disk at startup (before starting dovecot).
Excellent idea Eric. I'd never considered this. Truly, that's a fantastic, creative solution, and should be relatively straightforward to implement.
This isn't possible if you use the dovecot index memory settings though.
Yeah, I think the ramdisk is the way to go here. At least if/until a better solution can be found. I don't really see there is one, other than his org investing in a faster SAN architecture such as 4/8Gb FC or 10 Gbit iSCSI.
The former can be had relatively inexpensively. The latter is still really pricy. 10 GbE switches and HBAs are very pricey, and there are only a handful of iSCSI vendors offering 10 GbE SAN arrays. One is NetApp. Their 10 GbE NICs for their filers run in the multiple thousand dollar range per card. And their filers are the most expensive on the planet last I checked, much of that due to the flexibility. A single NetApp can support all speeds of Ethernet for iSCSI and NFS/CIFS access, as well as 2/4/8 Gbit FC. I think they offer Infiniband connectivity as well.
If this is a POP server, then you really have no way around the disk I/O issue.
I agree. POP is very inefficient...
XFS with delaylog can cut down substantially on the metadata operations associated with POP3 mass delete. Without this FS and delaylog, yes, POP3 I/O is very inefficient.
Still some room for filesystem tuning, of course, but the above two options are of course the ones that will make the largest performance improvement IMHO.
Since Javier is looking for ways to decrease I/O load on the SAN, not necessarily increase Dovecot performance, I think putting the index files on a ramdisk is best thing to try first. It may not be a silver bullet. If he's still got spare memory to add to this guest, doing both would be better. Using a ramdisk for the index files will instantly remove all index I/O from the SAN. More of Dovecot's IMAP I/O is to the index files than mail files isn't it? So by moving the index files to ramdisk you should pretty much instantly remove half your SAN I/O load. This is assuming that Javier currently stores his index files on a SAN LUN.
-- Stan