[Dovecot] dsync redesign

Wed Mar 28 19:07:59 EEST 2012

On 3/27/2012 3:57 PM, Jeff Gustafson wrote:

> 	We do have a FC system that another department is using. The company
> dropped quite a bit of cash on it for a specific purpose. Our department
> does not have access it to.  People are somewhat afraid of iSCSI around
> here because they believe it will add too much latency to the overall IO
> performance. They're a big believer in locally attached disks. Less
> features, but very good performance.

If you use a software iSCSI initiator with standard GbE ports, block IO
latency can become a problem, but basically in only 3 scenarios:

1.  Slow CPUs or not enough CPUs/cores.  This is unlikely to be a
problem in 2012, given the throughput of today's multi-core CPUs.  Low
CPU throughput hasn't generally been the cause of software iSCSI
initiator latency problems since pre-2007/8 with most applications.  I'm
sure some science/sim apps that tax both CPU and IO may have still had
issues.  Those would be prime candidates for iSCSI HBAs.

2.  An old OS kernel that doesn't thread IP stack, SCSI encapsulation,
and/or hardware interrupt processing amongst all cores.  Recent Linux
kernels do this rather well, especially with MSI-X enabled, older ones
not so well.  I don't know about FreeBSD, Solaris, AIX, HP-UX, Windows, etc.

3.  System under sufficiently high CPU load to slow IP stack and iSCSI
encapsulation processing, and or interrupt handling.  Again, with
today's multi-core fast CPUs this probably isn't going to be an issue,
especially given that POP/IMAP are IO latency bound, not CPU bound.
Most people running Dovecot today are going to have plenty of idle CPU
cycles to perform the additional iSCSI initiator and TCP stack
processing without introducing undue block IO latency effects.

As always, YMMV.  The simply path is to acquire your iSCSI SAN array and
use software initiators on client hosts.  In the unlikely event you do
run into block IO latency issues, you simply drop an iSCSI HBA into each
host suffering the latency.  They run ~$700-900 USD each for single port
models, and they eliminate block IO latency completely, which is one
reason they cost so much.  They have an onboard RISC chip and memory
doing the TCP and SCSI encapsulation processing.  They also give you the
ability to boot diskless servers from LUNs on the SAN array.  This is
very popular with blade server systems, and I've done this many times
myself, albeit with fibre channel HBAs/SANs, not iSCSI.

Locally attached/internal/JBOD storage typically offers the best
application performance per dollar spent, until you get to things like
backup scenarios, where off node network throughput is very low, and
your backup software may suffer performance deficiencies, as is the
issue titling this thread.  Shipping full or incremental file backups
across ethernet is extremely inefficient, especially with very large
filesystems.  This is where SAN arrays with snapshot capability come in
really handy.

The snap takes place wholly within the array and is very fast, without
the problems you see with host based snapshots such as with Linux LVM,
where you must first freeze the filesystem, wait for the snapshot to
complete, which could be a very long time with a 1TB FS.  While this
occurs your clients must wait or timeout while trying to access
mailboxes.  With a SAN array snapshot system this isn't an issue as the
snap is transparent to hosts with little or no performance degradation
during the snap.  Two relatively inexpensive units that have such
snapshot capability are:

http://www.equallogic.com/products/default.aspx?id=10613

http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241493.html

The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be
had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS.  Each offer 4 or
more host/network connection ports when equipped with dual controllers.
 There are many other vendors with similar models/capabilities.  I
mention these simply because Dell/HP are very popular and many OPs are
already familiar with their servers and other products.

> 	We thought ZFS would provide us with a nice snapshot and backup system
> (with zfs send). We never got that far once we discovered that ZFS
> doesn't work very well in this context. Running rsync on it gave us
> terrible performance.

There are 3 flavors of ZFS:  native Oracle Solaris, native FreeBSD,
Linux FUSE.  Which were you using?  If the last, that would fully
explain the suck.

>> Also, you speak of a very large maildir store, with hundreds of
>> thousands of directories, obviously many millions of files, of 1TB total
>> size.  Thus I would assume you have many thousands of users, if not 10s
>> of thousands.
>>
>> It's a bit hard to believe you're not running XFS on your storage, given
>> your level of parallelism.  You'd get much better performance using XFS
>> vs EXT4.  Especially with kernel 2.6.39 or later which includes the
>> delayed logging patch.  This patch increases metadata write throughput
>> by a factor of 2-50+ depending on thread count, and decreases IOPS and
>> MB/s hitting the storage by about the same factor, depending on thread
>> count.
> 
> 	I've relatively new here, but I'll ask around about XFS and see if
> anyone had tested it in the development environment.

If they'd tested it properly, and relatively recently, I would think
they'd have already replaced EXT4 on your Dovecot server.  Unless others
factors prevented such a migration.  Or unless I've misunderstood the
size of your maildir workload.

-- 
Stan