On 4/10/2012 1:09 AM, Emmanuel Noobadmin wrote:
On 4/10/12, Stan Hoeppner <stan@hardwarefreak.com> wrote:
SuperMicro H8SGL G34 mobo w/dual Intel GbE, 2GHz 8-core Opteron 32GB Kingston REG ECC DDR3, LSI 9280-4i4e, Intel 24 port SAS expander 20 x 1TB WD RE4 Enterprise 7.2K SATA2 drives NORCO RPC-4220 4U 20 Hot-Swap Bays, SuperMicro 865W PSU All other required parts are in the Wish List. I've not written assembly instructions. I figure anyone who would build this knows what s/he is doing.
Price today: $5,376.62
This price looks like something I might be able to push through
It's pretty phenomenally low considering what all you get, especially 20 enterprise class drives.
although I'll probably have to go SATA instead of SAS due to cost of keeping spares.
The 10K drives I mentioned are SATA not SAS. WD's 7.2k RE and 10k Raptor series drives are both SATA but have RAID specific firmware, better reliability, longer warranties, etc. The RAID specific firmware is why both are tested and certified by LSI with their RAID cards.
Configuring all 20 drives as a RAID10 LUN in the MegaRAID HBA would give you a 10TB net Linux device and 10 stripe spindles of IOPS and bandwidth. Using RAID6 would yield 18TB net and 18 spindles of read throughput, however parallel write throughput will be at least 3-6x slower than RAID10, which is why nobody uses RAID6 for transactional workloads.
Not likely to go with RAID 5 or 6 due to concerns about the uncorrectable read errors risks on rebuild with large arrays. Is the
Not to mention rebuild times for large width RAID5/6.
MegaRAID being used as the actual RAID controller or just as a HBA?
It's a top shelf RAID controller, 512MB cache, up to 240 drives, SSD support, the works. It's an LSI "Feature Line" card: http://www.lsi.com/products/storagecomponents/Pages/6GBSATA_SASRAIDCards.asp...
The specs: http://www.lsi.com/products/storagecomponents/Pages/MegaRAIDSAS9280-4i4e.asp...
You'll need the cache battery module for safe write caching, which I forgot in the wish list (now added), $160: http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163&Tpk=LSIiBBU08
With your workload and RAID10 you should run with all 512MB configured as write cache. Linux caches all reads so using any controller cache for reads is a waste. Using all 512MB for write cache will increase random write IOPS.
Note the 9280 allows up to 64 LUNs, so you can do tiered storage within this 20 bay chassis. For spares management you'd probably not want to bother with two different sized drives.
I didn't mention the 300GB 10K Raptors previously due to their limited capacity. Note they're only $15 more apiece than the 1TB RE4 drives in the original parts list. For a total of $300 more you get the same 40% increase in IOPs of the 600GB model, but you'll only have 3TB net space after RAID10. If 3TB is sufficient space for your needs, that extra 40% IOPS makes this config a no brainer. The decreased latency of the 10K drives will give a nice boost to VM read performance, especially when using NFS. Write performance probably won't be much different due to the generous 512MB write cache on the controller. I also forgot to mention that with BBWC enabled you can turn off XFS barriers, which will dramatically speed up Exim queues and Dovecot writes, all writes actually.
Again, you probably don't want the spares management overhead of two different disk types on the shelf, but you could stick these 10K 300s in the first 16 slots, and put the 2TB RE4 drive in the last 4 slots, RAID10 on the 10K drives, RAID5 on the 2TB drives. This yields an 8 spindle high IOPS RAID10 of 2.4TB and a lower performance RAID5 of 6TB for near line storage such as your Dovecot alt storage, VM templates, etc, 8.4TB net, 1.6TB less than the original 10TB setup. Total additional cost is $920 for this setup. You'd have two XFS filesystems (with quite different mkfs parameters).
I have been avoiding hardware RAID because of a really bad experience with RAID 5 on an obsolete controller that eventually died without replacement and couldn't be recovered. Since then, it's always been RAID 1 and, after I discovered mdraid, using them as purely HBA with mdraid for the flexibility of being able to just pull the drives into a new system if necessary without having to worry about the controller.
Assuming you have the right connector configuration for your drive/enclosure on the replacement card, you can usually swap out one LSI RAID card with any other LSI RAID card in the same, or newer, generation. It'll read the configuration metadata from the disks and be up an running in minutes. This feature has been around all the way back to the AMI/Mylex cards of the late 1990s. LSI acquired both companies, who were #1 and #2 in RAID, which is why LSI is so successful today. Back in those days LSI simply supplied the ASICs to AMI and Mylex. I have an AMI MegaRAID 428, top of the line in 1998, lying around somewhere. Still working when I retired it many years ago.
FYI, LSI is the OEM provider of RAID and SAS/SATA HBA ASIC silicon for the tier 1 HBA and mobo down markets. Dell, HP, IBM, Intel, Oracle (Sun), Siemens/Fujitsu, all use LSI silicon and firmware. Some simply rebadge OEM LSI cards with their own model and part numbers. IBM and Dell specifically have been doing this rebadging for well over a decade, long before LSI acquired Mylex and AMI. The Dell PERC/2 is a rebadged AMI MegaRAID 428.
Software and hardware RAID each have their pros and cons. I prefer hardware RAID for write cache performance and many administrative reasons, including SAF-TE enclosure management (fault LEDs, alarms, etc) so you know at a glance which drive has failed and needs replacing, email and SNMP notification of events, automatic rebuild, configurable rebuild priority, etc, etc, and good performance with striping and mirroring. Parity RAID performance often lags behind md with heavy workloads but not with light/medium. FWIW I rarely use parity RAID, due to the myriad performance downsides.
For ultra high random IOPS workloads, or when I need a single filesystem space larger than the drive limit or practical limit for one RAID HBA, I'll stitch hardware RAID1 or small stripe width RAID 10 arrays (4-8 drives, 2-4 spindles) together with md RAID 0 or 1.
Both of the drives I've mentioned here are enterprise class drives, feature TLER, and are on the LSI MegaRAID SAS hardware compatibility list. The price of the 600GB Raptor has come down considerably since I designed this system, or I'd have used them instead.
Anyway, lots of option out there. But $6,500 is pretty damn cheap for a quality box with 32GB RAM, enterprise RAID card, and 20x10K RPM 600GB drives.
The MegaRAID 9280-4i4e has an external SFF8088 port For an additional $6,410 you could add an external Norco SAS expander JBOD chassis and 24 more 600GB 10K RPM Raptors, for 13.2TB of total net RAID10 space, and 22 10k spindles of IOPS performance from 44 total drives. That's $13K for a 5K random IOPS, 13TB, 44 drive NFS RAID COTS server solution, $1000/TB, $2.60/IOPS. Significantly cheaper than an HP, Dell, IBM solution of similar specs, each of which will set you back at least 20 large.
Would this setup work well too for serving up VM images? I've been trying to find a solution for the virtualized app servers images as well but the distributed FSes currently are all bad with random reads/writes it seems. XFS seem to be good with large files like db and vm images with random internal write/read so given my time constraints, it would be nice to have a single configuration that works generally well for all the needs I have to oversee.
Absolutely. If you setup these 20 drives as a single RAID10, soft/hard or hybrid, with the LSI cache set to 100% write-back, with a single XFS filesystem with 10 allocation groups and proper stripe alignment, you'll get maximum performance for pretty much any conceivable workload.
Your only limitations will be possible NFS or TCP tuning issues, and maybe having only two GbE ports. For small random IOPS such as Exim queues, Dovecot store, VM image IO, etc, the two GbE ports are plenty. But if you add any large NFS file copies into the mix, such as copying new VM templates or ISO images over, etc, or do backups over NFS instead of directly on the host machine at the XFS level, then two bonded GbE ports might prove a bottleneck.
The mobo has 2 PCIe x8 slots and one x4 slot. One of the x8 slots is an x16 physical connector. You'll put the LSI card in the x16 slot. If you mount the Intel SAS expander to the chassis as I do instead of in a slot, you have one free x8 and one free x4 slot. Given the $250 price, I'd simply ad an Intel quad port GbE NIC to the order. Link aggregate all 4 ports on day one and use one IP address for the NFS traffic. Use the two on board ports for management etc. This should give you a theoretical 400MB/s of peak NFS throughput, which should be plenty no matter what workload you throw at it.
Note the chassis I've spec'd have single PSUs, not the dual or triple redundant supplies you'll see on branded hardware. With a relatively stable climate controlled environment and a good UPS with filtering, quality single supplies are fine. In fact, in the 4U form factor single supplies are usually more reliable due to superior IC packaging and airflow through the heatsinks, not to mention much quieter.
Same reason I do my best to avoid 1U servers, the space/heat issues worries me. Yes, I'm guilty of worrying too much but that had saved me on several occasions.
Just about every 1U server I've seen that's been racked for 3 or more years has warped under its own weight. I even saw an HPQ 2U that was warped this way, badly warped. In this instance the slide rail bolts had never been tightened down to the rack--could spin them by hand. Since the chassis side panels weren't secured, and there was lateral play, the weight of the 6 drives caused the side walls of the case to fold into a mild trapezoid, which allowed the bottom and top panels to bow. Let this be a lesson boys and girls: always tighten your rack bolts. :)
-- Stan