[Dovecot] Best Cluster Storage

Stan Hoeppner stan at hardwarefreak.com
Fri Jan 21 17:09:37 EET 2011


Jan-Frode Myklebust put forth on 1/21/2011 5:49 AM:
> On Thu, Jan 20, 2011 at 10:14:42PM -0600, Stan Hoeppner wrote:
>>
>> Have you considered SGI CXFS?  It's the fastest cluster FS on the planet by an
>> order of magnitude.  It uses dedicated metadata servers instead of a DLM, which
>> is why it's so fast.  Directory traversal operations would be orders of
>> magnitude faster than what you have now.
> 
> That sounds quite impressive. Order of magnitude improvements would be 
> very welcome. Do you have any data to back up that statement ? Are you 
> talking streaming performance, IOPS or both ?

Both.

> I've read that CXFS has bad metadata performance, and that the 
> metadata-server can become a bottleneck.. Is the metadata-server 
> function only possible to run on one node (with passive standby node for
> availability) ?

Where did you read this?  I'd like to take a look.  The reason CXFS is faster
than other cluster filesystems is _because of_ the metadata broker.  It is much
faster than distributed lock manager schemes at high loads, and equally fast at
low loads.  There is one active metadata broker server _per filesystem_ with as
many standby backup servers per filesystem as you want.  So for a filesystem
seeing heavy IOPS you'd want a dedicated metadata broker.  For filesystems
storing large amounts of data but with low metadata IOPS you would use one
broker server for multiple such filesystems.

Using GbE for the metadata network yields excellent performance.  Using
Infiniband is even better, especially with large CXFS client node counts under
high loads, due to the dramatically lower packet latency through the switches,
and a typical 20 or 40 Gbit signaling rate for 4x DDR/QDR.  Using Infiniband for
the metadata network actually helps DLM cluster filesystems more than those with
metadata brokers.

> Do you know anything about the pricing of CXFS? I'm quite satisfied with
> GPFS, but know I might be a bit biased since I work for IBM :-) If CXFS
> really is that good for maildir-type storage, I probably should have
> another look..

Given the financial situation SGI has found itself in the last few years, I have
no idea how they're pricing CXFS or the SAN arrays.  One acquisition downside to
CXFS is that you have to deploy the CXFS metadata brokers on SGI hardware only,
and their servers are more expensive that most nearly identical competing products.

Typically, they only sell CXFS as an add on to their fiber channel SAN products.
 So it's not an inexpensive solution.  It's extremely high performance, but you
pay for it.  Honestly, for most organizations doing mail clusters, unless you
have a _huge_ user base and lots of budget, you might not afford an SGI solution
for mail cluster data storage.  It never hurts to ask though, and sales people's
time is free to potential customers.  If your current cluster filesystem+SAN
isn't cutting it, it can't hurt to ask an SGI salesperson.

At minimum you're probably looking at the cost of an Altix UV10 for the metadata
broker server, an SGI InfiniteStorage 4100 Array, and the CXFS licenses for each
cluster node you connect.  Obviously you'll need other things such as a fiber
channel switch, HBAs, etc, but that's the same with for any other fiber channel
cluster setup.

Even though you may pay a small price premium, SGI's fiber channel arrays are
truly some of the best available.  The specs on their lowest end model, the
4100, are pretty darn impressive for the _bottom_ of the line card:
http://www.sgi.com/pdfs/4180.pdf

If/when deploying such a solution, it really pays to use fewer fat Dovecot nodes
instead of lots of thin nodes.  Fewer big core count boxes with lots of memory
and a single FC HBA cost less in the long run than many lower core count boxes
with low memory and an HBA.  The cost of a single port FC HBA is typically more
than a white box 1U single socket quad core server with 4GB RAM.  Add the FC HBA
and CXFS license to each node and you should see why fewer larger nodes is better.

-- 
Stan


More information about the dovecot mailing list