[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Stan Hoeppner stan at hardwarefreak.com
Wed Apr 11 10:18:49 EEST 2012


On 4/10/2012 1:09 AM, Emmanuel Noobadmin wrote:
> On 4/10/12, Stan Hoeppner <stan at hardwarefreak.com> wrote:

>> SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron
>> 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander
>> 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives
>> NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU
>> All other required parts are in the Wish List.  I've not written
>> assembly instructions.  I figure anyone who would build this knows what
>> s/he is doing.
>>
>> Price today:  $5,376.62
> 
> This price looks like something I might be able to push through

It's pretty phenomenally low considering what all you get, especially 20
enterprise class drives.

> although I'll probably have to go SATA instead of SAS due to cost of
> keeping spares.

The 10K drives I mentioned are SATA not SAS.  WD's 7.2k RE and 10k
Raptor series drives are both SATA but have RAID specific firmware,
better reliability, longer warranties, etc.  The RAID specific firmware
is why both are tested and certified by LSI with their RAID cards.

>> Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give
>> you a 10TB net Linux device and 10 stripe spindles of IOPS and
>> bandwidth.  Using RAID6 would yield 18TB net and 18 spindles of read
>> throughput, however parallel write throughput will be at least 3-6x
>> slower than RAID10, which is why nobody uses RAID6 for transactional
>> workloads.
> 
> Not likely to go with RAID 5 or 6 due to concerns about the
> uncorrectable read errors risks on rebuild with large arrays. Is the

Not to mention rebuild times for large width RAID5/6.

> MegaRAID being used as the actual RAID controller or just as a HBA?

It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD
support, the works.  It's an LSI "Feature Line" card:
http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.aspx

The specs:
http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.aspx

You'll need the cache battery module for safe write caching, which I
forgot in the wish list (now added), $160:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163&Tpk=LSIiBBU08

With your workload and RAID10 you should run with all 512MB configured
as write cache.  Linux caches all reads so using any controller cache
for reads is a waste.  Using all 512MB for write cache will increase
random write IOPS.

Note the 9280 allows up to 64 LUNs, so you can do tiered storage within
this 20 bay chassis.   For spares management you'd probably not want to
bother with two different sized drives.

I didn't mention the 300GB 10K Raptors previously due to their limited
capacity.  Note they're only $15 more apiece than the 1TB RE4 drives in
the original parts list.  For a total of $300 more you get the same 40%
increase in IOPs of the 600GB model, but you'll only have 3TB net space
after RAID10.  If 3TB is sufficient space for your needs, that extra 40%
IOPS makes this config a no brainer.  The decreased latency of the 10K
drives will give a nice boost to VM read performance, especially when
using NFS.  Write performance probably won't be much different due to
the generous 512MB write cache on the controller.  I also forgot to
mention that with BBWC enabled you can turn off XFS barriers, which will
dramatically speed up Exim queues and Dovecot writes, all writes actually.

Again, you probably don't want the spares management overhead of two
different disk types on the shelf, but you could stick these 10K 300s in
the first 16 slots, and put the 2TB RE4 drive in the last 4 slots,
RAID10 on the 10K drives, RAID5 on the 2TB drives.  This yields an 8
spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB
for near line storage such as your Dovecot alt storage, VM templates,
etc, 8.4TB net, 1.6TB less than the original 10TB setup.  Total
additional cost is $920 for this setup.  You'd have two XFS filesystems
(with quite different mkfs parameters).

> I have been avoiding hardware RAID because of a really bad experience
> with RAID 5 on an obsolete controller that eventually died without
> replacement and couldn't be recovered. Since then, it's always been
> RAID 1 and, after I discovered mdraid, using them as purely HBA with
> mdraid for the flexibility of being able to just pull the drives into
> a new system if necessary without having to worry about the
> controller.

Assuming you have the right connector configuration for your
drive/enclosure on the replacement card, you can usually swap out one
LSI RAID card with any other LSI RAID card in the same, or newer,
generation.  It'll read the configuration metadata from the disks and be
up an running in minutes.  This feature has been around all the way back
to the AMI/Mylex cards of the late 1990s.  LSI acquired both companies,
who were #1 and #2 in RAID, which is why LSI is so successful today.
Back in those days LSI simply supplied the ASICs to AMI and Mylex.  I
have an AMI MegaRAID 428, top of the line in 1998, lying around
somewhere.  Still working when I retired it many years ago.

FYI, LSI is the OEM provider of RAID and SAS/SATA HBA ASIC silicon for
the tier 1 HBA and mobo down markets.  Dell, HP, IBM, Intel, Oracle
(Sun), Siemens/Fujitsu, all use LSI silicon and firmware.  Some simply
rebadge OEM LSI cards with their own model and part numbers.  IBM and
Dell specifically have been doing this rebadging for well over a decade,
long before LSI acquired Mylex and AMI.  The Dell PERC/2 is a rebadged
AMI MegaRAID 428.

Software and hardware RAID each have their pros and cons.  I prefer
hardware RAID for write cache performance and many administrative
reasons, including SAF-TE enclosure management (fault LEDs, alarms, etc)
so you know at a glance which drive has failed and needs replacing,
email and SNMP notification of events, automatic rebuild, configurable
rebuild priority, etc, etc, and good performance with striping and
mirroring.  Parity RAID performance often lags behind md with heavy
workloads but not with light/medium.  FWIW I rarely use parity RAID, due
to the myriad performance downsides.

For ultra high random IOPS workloads, or when I need a single filesystem
space larger than the drive limit or practical limit for one RAID HBA,
I'll stitch hardware RAID1 or small stripe width RAID 10 arrays (4-8
drives, 2-4 spindles) together with md RAID 0 or 1.

>> Both of the drives I've mentioned here are enterprise class drives,
>> feature TLER, and are on the LSI MegaRAID SAS hardware compatibility
>> list.  The price of the 600GB Raptor has come down considerably since I
>> designed this system, or I'd have used them instead.
>>
>> Anyway, lots of option out there.  But $6,500 is pretty damn cheap for a
>> quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB
>> drives.
>>
>> The MegaRAID 9280-4i4e has an external SFF8088 port  For an additional
>> $6,410 you could add an external Norco SAS expander JBOD chassis and 24
>> more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22
>> 10k spindles of IOPS performance from 44 total drives.  That's $13K for
>> a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution,
>> $1000/TB, $2.60/IOPS.  Significantly cheaper than an HP, Dell, IBM
>> solution of similar specs, each of which will set you back at least 20
>> large.
> 
> Would this setup work well too for serving up VM images? I've been
> trying to find a solution for the virtualized app servers images as
> well but the distributed FSes currently are all bad with random
> reads/writes it seems. XFS seem to be good with large files like db
> and vm images with random internal write/read so given my time
> constraints, it would be nice to have a single configuration that
> works generally well for all the needs I have to oversee.

Absolutely.  If you setup these 20 drives as a single RAID10, soft/hard
or hybrid, with the LSI cache set to 100% write-back, with a single XFS
filesystem with 10 allocation groups and proper stripe alignment, you'll
get maximum performance for pretty much any conceivable workload.

Your only limitations will be possible NFS or TCP tuning issues, and
maybe having only two GbE ports.  For small random IOPS such as Exim
queues, Dovecot store, VM image IO, etc, the two GbE ports are plenty.
But if you add any large NFS file copies into the mix, such as copying
new VM templates or ISO images over, etc, or do backups over NFS instead
of directly on the host machine at the XFS level, then two bonded GbE
ports might prove a bottleneck.

The mobo has 2 PCIe x8 slots and one x4 slot.  One of the x8 slots is an
x16 physical connector.  You'll put the LSI card in the x16 slot.  If
you mount the Intel SAS expander to the chassis as I do instead of in a
slot, you have one free x8 and one free x4 slot.  Given the $250 price,
I'd simply ad an Intel quad port GbE NIC to the order.  Link aggregate
all 4 ports on day one and use one IP address for the NFS traffic.  Use
the two on board ports for management etc.  This should give you a
theoretical 400MB/s of peak NFS throughput, which should be plenty no
matter what workload you throw at it.

>> Note the chassis I've spec'd have single PSUs, not the dual or triple
>> redundant supplies you'll see on branded hardware.  With a relatively
>> stable climate controlled environment and a good UPS with filtering,
>> quality single supplies are fine.  In fact, in the 4U form factor single
>> supplies are usually more reliable due to superior IC packaging and
>> airflow through the heatsinks, not to mention much quieter.
> 
> Same reason I do my best to avoid 1U servers, the space/heat issues
> worries me. Yes, I'm guilty of worrying too much but that had saved me
> on several occasions.

Just about every 1U server I've seen that's been racked for 3 or more
years has warped under its own weight.  I even saw an HPQ 2U that was
warped this way, badly warped.  In this instance the slide rail bolts
had never been tightened down to the rack--could spin them by hand.
Since the chassis side panels weren't secured, and there was lateral
play, the weight of the 6 drives caused the side walls of the case to
fold into a mild trapezoid, which allowed the bottom and top panels to
bow.  Let this be a lesson boys and girls:  always tighten your rack
bolts. :)

-- 
Stan


More information about the dovecot mailing list