On 4/19/2012 4:40 AM, Stan Hoeppner wrote:
On 4/17/2012 8:01 AM, Frank Bonnet wrote:
have 4000/6000 imaps concurent connections during working hours .
for approx 50K "intensives" users.
The only mandatory thing will be I must use HP proliant servers
The operating system will be FreeBSD or Linux
I just made the wishlist public so it should be available tomorrow or Friday. I'll provide the link when it's available.
And here it is: http://secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=16797...
Since your requirement is for an HP solution, following is an HP server and storage system solution of roughly identical performance and redundancy to the SuperMicro based system I detailed. The HP system solution is $44,263, almost double the cost at $20,000 more. Due to the stupidity of Newegg requiring all wish lists to be reviewed before going live, I'll simply provide the links to all the products.
Yes boys and girls, Newegg isn't just consumer products. They carry nearly the entire line of HP Proliant servers and storage, including the 4-way 48-core Opteron DL585 G7 w/64GB, the P2000 fiber channel array, and much more. In this case they sell every product needed to assemble this complete mail server solution:
1x http://www.newegg.com/Product/Product.aspx?Item=N82E16859105807 8x http://www.newegg.com/Product/Product.aspx?Item=N82E16820326150 3x http://www.newegg.com/Product/Product.aspx?Item=N82E16816401143 80x http://www.newegg.com/Product/Product.aspx?Item=N82E16822332061 3x http://www.newegg.com/Product/Product.aspx?Item=N82E16816118109 3x http://www.newegg.com/Product/Product.aspx?Item=N82E16816118163 2x http://www.newegg.com/Product/Product.aspx?Item=N82E16816133048 2x http://www.newegg.com/Product/Product.aspx?Item=N82E16833106050
The 9280-8e RAID controllers are identical to 9261-8i boards but have 2 external vs internal x4 6Gb SAS ports. I spec them instead of the Smart Array boards as they're far cheaper, easier to work with, and offer equal or superior performance. Thus everything written below is valid for this system as well, with the exception that you would configure 1 global hot spare in each chassis since these units have 25 drive bays instead of 24. The D2700 units come with 20" 8088 cables. I an additional spec'd two 3ft cables to make sure we reach all 3 disk chassis from the server, thinking the sever would be on top with the 3 disk chassis below.
I hope this and my previous post are helpful in one aspect or another to Frank and anyone else. I spent more than a few minutes on these designs. ;) Days in fact on the SuperMicro design, only a couple of hours on the HP. It wouldn't have taken quite so long if all PCIe slots were created equal (x8), which they're not, or if modern servers didn't require 4 different types of DIMMs depending on how many slots you want to fill and how much expansion capacity you need without having to throw out all the previous memory, which many folks end up doing out of ignorance. Memory configuration is simply too darn complicated with high cap servers containing 8 channels and 24 slots.
The key to performance, and yielding a single file tree, is once again using XFS to take advantage of this large spindle count across 3 RAID controllers. Unlike previous configurations where I recommended using a straight md concatenation of hardware RAID1 pairs, in this case we're going to use a concatenation of 6 hardware RAID10 arrays. There are a couple of reasons for doing so in this case:
Using 36 device names in a single md command line is less than intuitive and possibly error prone. Using 6 is more manageable.
We have 3 BBWC RAID controllers w/24 drives each. This is a high performance server and will see a high IO load in production. In many cases one would use an external filesystem journal, which we could easily do and get great performance with our mirrored SSDs. However, the SSDs are not backed by BBWC, so a UPS failure or system crash could hose the log journal. So we'll go with the default internal journal which will be backed by the BBWC.
Going internal with the log in this mail scenario can cause a serious amount of extra IOPS on the filesystem data section, this being Allocation Group 0. If we did the "normal" RAID1 concat, all the log IO would hit the first RAID1 pair. On this system, the load may hit that spindle pretty hard, making access to mailboxes in AG0 slower than others. With 6 RAID10 arrays in a concat, the internal log writes will be striped across 6 spindles in the first array. With 512MB BBWC backing that array and optimizing writeout, and with delaylog, this will yield optimal log write performance without slowing down mailbox file access in AG0. To create such a setup we'd do something like this, assuming the mobo LSI controller yields sd[ab], and the 6 array devices on the PCIe LSI cards yield sd[cdefgh]
- Create two RAID10 arrays, each of 12 drives, in the WebBIOS GUI of each LSI card, using a strip size of 32KB which should yield good random r/w performance for any mailbox format. Use the following policies for each array: RW, Normal, Wback, Direct, Disable, No, and use the full size.
Create the concatenated md device: $ mdadm -C /dev/md0 -l linear -n 6 /dev/sd[cdefgh]
Then we format it with XFS, optimizing the AG layout for our mailbox workload, and allocation write stripe alignment to each hardware array: $ mkfs.xfs -d agcount=24 su=32k sw=6 /dev/md0
This yields 4 AGs per RAID10 array which will minimize the traditional inode64 head seeking overhead on striped arrays, while still yielding fantastic allocation parallelism with 24 AGs.
Optimal fstab for MTA queue/mailbox workload, assuming kernel 2.6.39+: /dev/md0 /mail xfs defaults,inode64,nobarrier 0 0
We disable write barriers as we have BBWC. And that 1.5GB of BBWC will yield extremely low Dovecot write latency and throughput.
Given the throughput available, if you're running Postfix on this box, you will want to create a directory on this filesystem for the Postfix spool. Postfix puts the spool files in many dozens, hundreds of subdirectories, so you'll get 100% parallelism across all AGs, thus all disks.
It's very likely none of you will decide to build this system. My hope is that some of the design concepts and components used, along with the low cost but high performance of this machine, may be educational or simply give people new ideas, steer them in directions they may not have previously considered.
-- Stan