On 4/17/2012 8:01 AM, Frank Bonnet wrote:
have 4000/6000 imaps concurent connections during working hours .
POP3 users will be very few
How much disk space do you plan to offer per user mail directory? Will you be using quotas?
I need some feedbacks advices of experienced admins I will have to setup in few monthes an email system for approx 50K "intensives" users.
The only mandatory thing will be I must use HP proliant servers
The operating system will be FreeBSD or Linux
Quite a coincidence Frank. It's a shame it has to be an HP solution. I just finished designing a high quality high performance 4U 72 drive server yesterday that will easily handle 15K concurrent IMAP users, for only ~$24K USD, $0.48/user @50K users. So it may not be of interest to you, but maybe to others. It is capable of ~7K random 4KB r/w IOPS sustained, has 10TB net space for an average ~200MB/user mail directory assuming 50K users. The parts for this machine run ~$24K USD at Newegg. I just made the wishlist public so it should be available tomorrow or Friday. I'll provide the link when it's available. All components used are top quality, best available in the channel. The reliability of the properly assembled server will rival that of any HP/Dell/IBM machine. For those not familiar with SuperMicro, they manufacture many of Intel's retail boards and have for a decade+. The majority of the COTS systems used in large academic HPC clusters are built with SuperMicro chassis and motherboards, as well as some 1000+ node US DOE clusters. Here are the basics:
72x 2.5" bay 4U chassis, 3x SAS backplanes each w/redundant expanders: http://www.newegg.com/Product/Product.aspx?Item=N82E16811152212 78x Seagate 10K SAS 300GB drives--includes 6 spares Triple LSI 9261-8i dual port 512MB BBWC RAID controllers each with 2 redundant load balanced connections to a backplane 24 drives per controller for lowest latency, maximum throughput, 1.5GB total write cache, a rebuild affects only one controller, etc SuperMicro mainboard, 2x 6-core 3.3GHz AMD Interlagos Opteron CPUs 64GB Reg ECC DDR3-1066, 8x8GB DIMMs, 34GB/s aggregate bandwidth Dual Intel Quad port GbE NICs, 10 total Intel GbE ports Use the 2 mobo ports for redundant management links Aggregate 4 ports, 2 on each quad NIC, for mail traffic Aggregate the remaining 4 for remote backup, future connection to an iSCSI SAN array, etc Or however works best--having 8 GbEs gives flexibility and these two cards are only $500 of the total 2x Intel 20GB SSD internal fixed drives, hardware mirrored by the onboard LSI SAS chip, for boot/OS
The key to performance, and yielding a single file tree, is once again using XFS to take advantage of this large spindle count across 3 RAID controllers. Unlike previous configurations where I recommended using a straight md concatenation of hardware RAID1 pairs, in this case we're going to use a concatenation of 6 hardware RAID10 arrays. There are a couple of reasons for doing so in this case:
Using 36 device names in a single md command line is less than intuitive and possibly error prone. Using 6 is more manageable.
We have 3 BBWC RAID controllers w/24 drives each. This is a high performance server and will see a high IO load in production. In many cases one would use an external filesystem journal, which we could easily do and get great performance with our mirrored SSDs. However, the SSDs are not backed by BBWC, so a UPS failure or system crash could hose the log journal. So we'll go with the default internal journal which will be backed by the BBWC.
Going internal with the log in this mail scenario can cause a serious amount of extra IOPS on the filesystem data section, this being Allocation Group 0. If we did the "normal" RAID1 concat, all the log IO would hit the first RAID1 pair. On this system, the load may hit that spindle pretty hard, making access to mailboxes in AG0 slower than others. With 6 RAID10 arrays in a concat, the internal log writes will be striped across 6 spindles in the first array. With 512MB BBWC backing that array and optimizing writeout, and with delaylog, this will yield optimal log write performance without slowing down mailbox file access in AG0. To create such a setup we'd do something like this, assuming the mobo LSI controller yields sd[ab], and the 6 array devices on the PCIe LSI cards yield sd[cdefgh]
- Create two RAID10 arrays, each of 12 drives, in the WebBIOS GUI of each LSI card, using a strip size of 32KB which should yield good random r/w performance for any mailbox format. Use the following policies for each array: RW, Normal, Wback, Direct, Disable, No, and use the full size.
Create the concatenated md device: $ mdadm -C /dev/md0 -l linear -n 6 /dev/sd[cdefgh]
Then we format it with XFS, optimizing the AG layout for our mailbox workload, and allocation write stripe alignment to each hardware array: $ mkfs.xfs -d agcount=24 su=32k sw=6 /dev/md0
This yields 4 AGs per RAID10 array which will minimize the traditional inode64 head seeking overhead on striped arrays, while still yielding fantastic allocation parallelism with 24 AGs.
Optimal fstab for MTA queue/mailbox workload, assuming kernel 2.6.39+: /dev/md0 /mail xfs defaults,inode64,nobarrier 0 0
We disable write barriers as we have BBWC. And that 1.5GB of BBWC will yield extremely low Dovecot write latency and throughput.
Given the throughput available, if you're running Postfix on this box, you will want to create a directory on this filesystem for the Postfix spool. Postfix puts the spool files in many dozens, hundreds of subdirectories, so you'll get 100% parallelism across all AGs, thus all disks.
It's very likely none of you will decide to build this system. My hope is that some of the design concepts and components used, along with the low cost but high performance of this machine, may be educational or simply give people new ideas, steer them in directions they may not have previously considered.
-- Stan