[Dovecot] Highly Performance and Availability

Tue Feb 16 15:29:24 EET 2010

Wayne Thursby put forth on 2/15/2010 11:42 PM:
> Hello everyone,

Note domain in my email addy Wayne. ;)

> I have been looking at the Dell EqualLogic stuff and it seems to provide
> what we need. I can get most of the information I need from the rep, but
> I wonder if anyone has any experience with high performance requirements
> on these kinds of storage.

EqualLogic has nice iSCSI SAN storage arrays with internal multiple snapshot
ability and what not, but IMHO they're way over priced for what you get.

> I'd like to continue running my current hardware as the primary mail
> server, but provide some kind of failover using the SAN. The primary
> usage of the SAN will be to make our 2TB document store highly
> available. I'm wondering what kind of options I might have in the way of
> piggybacking some email failover on this kind of hardware without
> sacrificing the performance I'm currently enjoying.

Give me the specs on your current SAN setup and I'll give you some good options.

1.  What/how many FC switches do you have, Brocade, Qlogic, etc?
2.  What make/model is your current SAN array controller(s), what disk config?

> Is it possible to go with a virtual machine mounted on iSCSI acting as a
> backup mail server? How would I sync the two, NBD+MD? Any experience
> doing this with maildirs? I wonder about the performance.

This isn't the way to go about it.  You already have an FC SAN and VMware ESX.
ESX+SAN is _THE_ way to do HA/failover with Vmotion.  I haven't use it since
ESX3, but I must say, there is no better solution available on the planet.  It's
nearly perfect.

> Can it be as simple as attaching my MD-1000's second controller to the
> SAN magic box via SAS and pressing the Easy button?

No.  The MD-1000 is direct attached storage, i.e. dumb storage, and you have it
configured with a hardware RAID controller in a single host.  You can't share it
with another host.  To share storage arrays between/among multiple hosts
requires an intelligent controller in the array chassis doing the RAID, multiple
host port connections (FC, SCSI, iSCSI), and a cluster filesystem on the hosts
to coordinate shared fs access and file locking.  This is exactly what ESX does
with multiple ESX hosts and a SAN array.

> Is it as expensive as running my primary mailserver mounted from the SAN
> via Fiber Channel? Will that get me under 30ms latency?

I'm not sure what you mean by "expensive" in this context.  Also, the latency
will be dependent on the SAN storage array(s) and FC network.  From experience,
it is typically extremely low, adding an extra few milliseconds to disk access
time, from less than 1ms with low load to maybe 3-5ms for a loaded good quality
SAN array.  This is also somewhat dependent on the number of FC switch hops
between the ESX hosts and the SAN array box--the more switch hops in the chain,
the greater the FC network latency.  That said, the greatest latency is going to
be introduced by the SAN storage controllers (those smart circuit boards inside
the SAN disk boxen that perform the RAID and FC input/output functions.

To give you an idea of the performance you can get from an FC SAN and a couple
of decent storage arrays, I architected and implemented a small FC SAN for a 500
user private school.  I had 7 blade servers, 2 ESX, 4 Citrix, and one Exchange
server, none with local disk.  Everything booted and ran from SAN storage, the
VMs and all their data, the Exchange server and its store, the Citrix blades,
everything.

We had about 20 VMs running across the two ESX blades and I could vmotion just
about any VM guest server, IN REAL TIME, from one ESX blade server to the other,
in less than 5 seconds.  Client network requests were never interrupted.
Vmotion is freak'n amazing technology.  Anyway, our total CPU and SAN load over
the entire infrastructure averaged about 20% utilization.  The VMs included two
AD DCs, a MS SQL server, Windows file/print servers, myriad SuSE VMs one running
a 400GB iFolder datastore (think network file shares on steroids, fully
synchronized roaming laptop filesystems sync'd in real time over the network or
internet to the iFolder data store), a Novell ZEN server for SuSE Linux
workstation push/pull updates and laptop imaging, a Moodle Php/MySQL based
course management system with a 50GB db, a Debian syslog collector, etc, etc.
Most of the VMs' bot disk images resided on an IBM FasTt 600 array with 14 x
73GB 15Krpm disks in two RAID 5 arrays, one with 6 disks, one with 7, and one
hot spare, only 128MB write cache.  The 4 Citrix blades used FasTt LUNs for
their local disks, and the Exchange server booted from a FasTt LUN and had its
DB stored on a FasTt LUN.  All other data storage including that of all the VMs
resided on a Nexsan Satablade SAN storage array consisting of 8 x 500GB 7.2Krpm
disks configured in a RAID5 set, no spares, 512MB write cache.  The Bladecenter
had an inbuilt 2 port FC switch.  I uplinked these two ports via ISL to an 8
port Qlogic 2Gb FC switch.  I had one 2 Gb FC link from the FasTt into the
Qlogic switch and two 2 Gb links from the Satablade into the switch.  For 500
users and every disk access going to these two SAN arrays, the hardware was
actually overkill for current needs.  But, it had plenty of headroom for spikes
and future growth, both in terms of throughput, latency, and storage capacity.

I ran an entire 500 user environment, all systems, all applications, on two
relatively low end FC SAN boxen, and you're concerned about the performance of a
single mail SMTP/IMAP server over a SAN?  I don't think you need to worry about
performance, as long as all is setup correctly.  ;)

To do this properly, you'll need a second Dell Server with FC HBA and an FC HBA
for the existing server, ESX vmotion and HA options, which I'm not sure are
available for ESXi.  You may have to upgrade to ESX, which as you know has some
pricey licensing.  But, it's worth the cost just for vmotion/HA.

You'll export a SAN LUN of sufficient size (500GB-1TB) to cover the IMAP store
needs from one of the SAN storage arrays to the WWNs of the HBAs in both of your
two ESX hosts, and you'll add that LUN to the ESX storage pool as a raw LUN.  Do
NOT make it a VMFS volume.  It's going to be huge, and you're only storing data
on it, not virtual machines.  VMFS volumes are for virtual machine storage, not
data storage.  Performance will suffer if you put large data in VMFS volumes.  I
cannot stress this enough.  For HA and vmotion to work, you'll also need to
export a small SAN LUN (20GB) to both ESX hosts' FC WWNs, format it as a VMFS,
and you'll move the Postfix/Dovecot virtual machine to that ESX VMFS volume.
I'm assuming you have the Postfix spool in the same VMFS volume as the boot and
root filesystems.  This will allow both ESX hosts to boot and run the VM and
enables vmotion and HA.  (I sincerely hope you don't currently have the VM files
and data store for your current IMAP store all in a single VMFS volume.  That's
horrible ESX implementation and will make this migration a bear due to all the
data shuffling you'll have to do between partitions/filesystems, and the fact
you'll probably have to shut down the server during the file moving).

You may need to add a soft zone to your FC switches containing the WWNs of the
ESX host HBAs and the WWN(s) of the SAN storage array ports you're exporting the
LUNs through before you'll see the exposed LUNs on the arrays.  Once you have
ESX, vmotion, and HA running on both ESX machines, all you have to do is enable
HA failover for the Postfix/Dovecot VM.  If the ESX host on which it's running,
or the VM guest, dies for any reason, the guest will be auto restarted within
seconds on the other ESX host.  This happens near instantaneously, and
transparently, because both hosts have "local disk" access to the same .vmdk
files and raw data LUN on the SAN arrays.  Clients probably won't even see an
error during the failover as IMAP clients reconnect and login automatically.
The name and IP address of the servers stays the same, the underlying server
itself, all it's config and spool files, metadata files, everything is identical
to before the crash.  It's just running on a different physical ESX machine.
This capability, more than anything else, is what makes VMware ESX worth the
licensing costs.  Absolutely seamless fault recovery.  If an organization can
afford it (can any not?) it's the only way to go for x86 based systems.

> I welcome any suggestions the group may have.

Unfortunately for this ESX HA architecture, your current MD-1000 isn't reusable.
 Direct attached storage will never work for any workable/functional HA setup.
If I were you, after you migrate your VMs to the SAN such as I mention above,
and obviously after you're comfortable all went as planned, I'd direct attach
the MD-1000 to another server and use it as a near line network backup server or
other meaningful purpose.

If your current FC SAN storage array doesn't have enough spare capacity
(performance/space), you can get a suitable unit from Nesxan and other
manufacturers for $10-15K in a single controller version.  I personally
recommend this:

http://www.nexsan.com/sataboy.php
http://www.sandirect.com/product_info.php?cPath=171_208_363&products_id=1434

Get a unit with the FC+iSCSI controller, 14x500GB drives, 1GB cache.  This is a
standard product configuration at Sandirect.  Configure a 13 drive RAID5 array
with one spare.  You may be balking at 7.2Krpm SATA drives and a RAID5 setup.  I
can tell you from experience with Nexsan's 8 drive Satablade unit (now
discontinued) with "only" 512MB cache and an 8 drive RAID5, you won't come close
to hitting any performance limits with your current load.  This setup would
likely carry 4-6 times your current load before introducing latency.  Nexsan
uses a PowerPC 64 chip on this controller and a very efficient RAID parity
algorithm.  The performance hit due to parity calculations going from RAID10 to
RAID5 is about 10%, but you gain that back because your stripe width is 13
instead of 7, assuming you use all 14 drives for the RAID10 with no spares--if
you use spares you must have two, since RAID10 requires an even number of disks,
making your stripe width 6.  Thus, with spares for each, the RAID5 stripe width
is _double_ the RAID10 width.

The WEB GUI admin interface is fantastic, simple.  Configure the RAID5 array,
then create your initial volumes that you'll export as LUNs to your two ESX
hosts' WWNs.  Only connect one FC port to the FC switch, and expose volumes to
both ESX hosts out the same port with the same LUN.  This is critical.  Both ESX
hosts need to see the same info or you will break things.  If you balk again
thinking a single 2Gb/4Gb FC link won't be fast enough, you'd be wrong.  In
addition, if you want to use both ports, you must have dual FC adapters in each
ESX host, and you must expose all LUNs out BOTH Sataboy ports to both FC WWNs on
each ESX host.  You then have to setup ESX FC multipathing, which IIRC is
another additional licensing fee, although I'm not positive on that.  To add
insult to injury, AFAIK, Nexsan isn't an ESX certified SAN vendor, so if you run
into problems getting the multipathing to work, VMware techs probably won't help
you.  As of 2006 they weren't certified, might be today, not sure.  All their
gear is fully FC compliant, but I guess they never felt like paying the "VMware
tax".

This is probably way too much less than optimally written/organized information
for the list, and probably a shade OT.  I'd be more than glad to continue this
off list with anyone interested in FC SAN stuff.  I've got some overly
aggressive spam filters, so if I block a direct email, hit postmaster@ my domain
and I'll see it.

-- 
Stan