[Dovecot] dsync redesign

Thu Mar 29 15:24:05 EEST 2012

On 3/28/2012 3:54 PM, Jeff Gustafson wrote:
> On Wed, 2012-03-28 at 11:07 -0500, Stan Hoeppner wrote:
> 
>> Locally attached/internal/JBOD storage typically offers the best
>> application performance per dollar spent, until you get to things like
>> backup scenarios, where off node network throughput is very low, and
>> your backup software may suffer performance deficiencies, as is the
>> issue titling this thread.  Shipping full or incremental file backups
>> across ethernet is extremely inefficient, especially with very large
>> filesystems.  This is where SAN arrays with snapshot capability come in
>> really handy.
> 
> 	I'm a new employee at the company. I was a bit surprised they were not
> using iSCSI. They claim they just can't risk the extra latency. I

The tiny amount of extra latency using a software initiator is a non
argument for a mail server workload, unless the server is undersized for
the workload--high CPU load and low memory constantly.  As I said, in
that case you drop in an iSCSI HBA and eliminate any possibility of
block latency.

> believe that you are right. It seems to me that offloading snapshots and
> backups to an iSCSI SAN would improve things. 

If you get the right unit you won't understand how you ever lived
without it.  The snaps complete transparently, and the data is on the
snap LUN within a few minutes, depending on the priority you give to
internal operations, snaps/rebuilds/etc, vs external IO requests.
Depending on model

> The problem is that this
> company has been burned on storage solutions more than once and they are
> a little skeptical that a product can scale to what they need. There are

More than once?  More than once??  Hmm...

> some SAN vendor names that are a four letter word here. So far, their
> newest FC SAN is performing well.

Interesting.  Care to name them (off list)?

> 	I think having more, small, iSCSI boxes would be a good solution. One
> problem I've seen with smaller iSCSI products is that feature sets like
> snapshotting are not the best implementation. It works, but doing any
> sort of automation can be painful.

As is most often the case, you get what you pay for.

>> The snap takes place wholly within the array and is very fast, without
>> the problems you see with host based snapshots such as with Linux LVM,
>> where you must first freeze the filesystem, wait for the snapshot to
>> complete, which could be a very long time with a 1TB FS.  While this
>> occurs your clients must wait or timeout while trying to access
>> mailboxes.  With a SAN array snapshot system this isn't an issue as the
>> snap is transparent to hosts with little or no performance degradation
>> during the snap.  Two relatively inexpensive units that have such
>> snapshot capability are:
> 
> 	How does this work? I've always had Linux create a snapshot. Would the
> SAN doing a snapshot without any OS buy-in cause the filesystem to be
> saved in an inconsistent state? I know that ext4 is pretty good at
> logging, but still, wouldn't this be a problem?

Instead of using "SAN" as a generic term for a "box", which it is not,
please use the terms "SAN" for "storage area network", "SAN array" or
"SAN controller" when talking about a box with or without disks that
performs the block IO shipping and other storage functions, "SAN switch"
for a fiber channel switch, or ethernet switch dedicated to the SAN
infrastructure.  The acronym "SAN" is an umbrella covering many
different types of hardware and network topologies.  It drives me nuts
when people call a fiber channel or iSCSI disk array a "SAN".  These can
be part of a SAN, but are not themselves, a SAN.  If they are direct
connected to a single host they are simple disk arrays, and the word
"SAN" isn't relevant.  Only uneducated people, or those who simply don't
care to be technically correct, call a single intelligent disk box a
"SAN".  Ok, end rant on "SAN".

Read this primer from Dell:
http://files.accord.com.au/EQL/Docs/CB109_Snapshot_Basic.pdf

The snapshots occur entirely at the controller/disk level inside the
box.  This is true of all SAN units that offer snap ability.  No host OS
involvement at all in the snap.  As I previously said, It's transparent.
 Snaps are filesystem independent, and are point-in-time, or PIT copies
of one LUN to another.  Read up on "LUN" if you're not familiar with the
term.  Everything in SAN storage is based on LUNs.

Now, as the document above will tell you, array based snapshots may or
may not be a total backup solution for your environment.  You need to
educate yourself and see if this technology is a feature that fits your
file backup and disaster avoidance and recovery needs.

>> http://www.equallogic.com/products/default.aspx?id=10613
>>
>> http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241493.html
>>
>> The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be
>> had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS.  Each offer 4 or
>> more host/network connection ports when equipped with dual controllers.
>>  There are many other vendors with similar models/capabilities.  I
>> mention these simply because Dell/HP are very popular and many OPs are
>> already familiar with their servers and other products.
> 
> 	I will take a look. I might have some convincing to do. 

SAN array features/performance are an easy sell.  Price not so much.
Each fully loaded ~24 drive SAN array is going to run you between
$15k-30k USD depending on the vendor and how many spindles you need for
IOPS, disk size for total storage, snap/replication features you need,
expandability, etc.

>> There are 3 flavors of ZFS:  native Oracle Solaris, native FreeBSD,
>> Linux FUSE.  Which were you using?  If the last, that would fully
>> explain the suck.
> 
> 	There is one more that I had never used before coming on board here:
> ZFSonLinux. ZFSonLinux is a real kernel level fs plugin. My

It's a "roll your own" patch set not in mainline and not supported by
any Linux distro/vendor, AFAIK.  Which is why I didn't include it.

> understanding is that they were using it on the backup machines with the
> front end dovecot machines using ext4. I'm told the metadata issue is a
> ZFS thing and they have the same problem on Solaris/Nexenta. 

I've never used ZFS, and don't plan to, so I can't really comment on
this.  That and I have no technical details of the problem.

>>> 	I've relatively new here, but I'll ask around about XFS and see if
>>> anyone had tested it in the development environment.
>>
>> If they'd tested it properly, and relatively recently, I would think
>> they'd have already replaced EXT4 on your Dovecot server.  Unless others
>> factors prevented such a migration.  Or unless I've misunderstood the
>> size of your maildir workload.
> 
> 	I don't know the entire history of things. I think they really wanted
> to use ZFS for everything and then fell back to ext4 because it
> performed well enough in the cluster. Performance becomes an issue with
> backups using rsync. Rsync is faster than Dovecot's native dsync by a
> very large margin. I know that dsync is doing more than rsync, but
> still, seconds compared to over five minutes? That is a significant
> difference. The problem is that rsync can't get a perfect backup.

This happens with a lot of "fan boys".  There was so much hype
surrounding ZFS that even many logically thinking people were frothing
at the mouth waiting to get their hands on it.  Then, as with many/most
things in the tech world, the goods didn't live up to the hype.

XFS has been around since 1994, has never had hype surrounding it, has
simply been steadily, substantially improved over time.  It has been
since day 1 the highest performance filesystem with parallel workloads,
and finally overcame its last barrier preventing it from being suitable
for just about any workload:  metadata write performance.  Which makes
it faster than any FS with the maildir workload when sufficient
parallelism/concurrency is present.  Meaning servers with a few thousand
active users will benefit.  Those with 7 users won't.

-- 
Stan