On 3/27/2012 3:57 PM, Jeff Gustafson wrote:
We do have a FC system that another department is using. The company dropped quite a bit of cash on it for a specific purpose. Our department does not have access it to. People are somewhat afraid of iSCSI around here because they believe it will add too much latency to the overall IO performance. They're a big believer in locally attached disks. Less features, but very good performance.
If you use a software iSCSI initiator with standard GbE ports, block IO latency can become a problem, but basically in only 3 scenarios:
Slow CPUs or not enough CPUs/cores. This is unlikely to be a problem in 2012, given the throughput of today's multi-core CPUs. Low CPU throughput hasn't generally been the cause of software iSCSI initiator latency problems since pre-2007/8 with most applications. I'm sure some science/sim apps that tax both CPU and IO may have still had issues. Those would be prime candidates for iSCSI HBAs.
An old OS kernel that doesn't thread IP stack, SCSI encapsulation, and/or hardware interrupt processing amongst all cores. Recent Linux kernels do this rather well, especially with MSI-X enabled, older ones not so well. I don't know about FreeBSD, Solaris, AIX, HP-UX, Windows, etc.
System under sufficiently high CPU load to slow IP stack and iSCSI encapsulation processing, and or interrupt handling. Again, with today's multi-core fast CPUs this probably isn't going to be an issue, especially given that POP/IMAP are IO latency bound, not CPU bound. Most people running Dovecot today are going to have plenty of idle CPU cycles to perform the additional iSCSI initiator and TCP stack processing without introducing undue block IO latency effects.
As always, YMMV. The simply path is to acquire your iSCSI SAN array and use software initiators on client hosts. In the unlikely event you do run into block IO latency issues, you simply drop an iSCSI HBA into each host suffering the latency. They run ~$700-900 USD each for single port models, and they eliminate block IO latency completely, which is one reason they cost so much. They have an onboard RISC chip and memory doing the TCP and SCSI encapsulation processing. They also give you the ability to boot diskless servers from LUNs on the SAN array. This is very popular with blade server systems, and I've done this many times myself, albeit with fibre channel HBAs/SANs, not iSCSI.
Locally attached/internal/JBOD storage typically offers the best application performance per dollar spent, until you get to things like backup scenarios, where off node network throughput is very low, and your backup software may suffer performance deficiencies, as is the issue titling this thread. Shipping full or incremental file backups across ethernet is extremely inefficient, especially with very large filesystems. This is where SAN arrays with snapshot capability come in really handy.
The snap takes place wholly within the array and is very fast, without the problems you see with host based snapshots such as with Linux LVM, where you must first freeze the filesystem, wait for the snapshot to complete, which could be a very long time with a 1TB FS. While this occurs your clients must wait or timeout while trying to access mailboxes. With a SAN array snapshot system this isn't an issue as the snap is transparent to hosts with little or no performance degradation during the snap. Two relatively inexpensive units that have such snapshot capability are:
http://www.equallogic.com/products/default.aspx?id=10613
http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241...
The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS. Each offer 4 or more host/network connection ports when equipped with dual controllers. There are many other vendors with similar models/capabilities. I mention these simply because Dell/HP are very popular and many OPs are already familiar with their servers and other products.
We thought ZFS would provide us with a nice snapshot and backup system (with zfs send). We never got that far once we discovered that ZFS doesn't work very well in this context. Running rsync on it gave us terrible performance.
There are 3 flavors of ZFS: native Oracle Solaris, native FreeBSD, Linux FUSE. Which were you using? If the last, that would fully explain the suck.
Also, you speak of a very large maildir store, with hundreds of thousands of directories, obviously many millions of files, of 1TB total size. Thus I would assume you have many thousands of users, if not 10s of thousands.
It's a bit hard to believe you're not running XFS on your storage, given your level of parallelism. You'd get much better performance using XFS vs EXT4. Especially with kernel 2.6.39 or later which includes the delayed logging patch. This patch increases metadata write throughput by a factor of 2-50+ depending on thread count, and decreases IOPS and MB/s hitting the storage by about the same factor, depending on thread count.
I've relatively new here, but I'll ask around about XFS and see if anyone had tested it in the development environment.
If they'd tested it properly, and relatively recently, I would think they'd have already replaced EXT4 on your Dovecot server. Unless others factors prevented such a migration. Or unless I've misunderstood the size of your maildir workload.
-- Stan