Re: [Dovecot] Configuration advices for a 50000 mailboxes server(s)

19 Apr 2012

      On 4/17/2012 8:01 AM, Frank Bonnet wrote:
...
have 4000/6000 imaps concurent connections during working hours .
POP3 users will be very few
How much disk space do you plan to offer per user mail directory?  Will
you be using quotas?
...
...
...
I need some feedbacks advices of experienced admins
I will have to setup in few monthes an email system
for approx 50K "intensives" users.
The only mandatory thing will be I must use HP proliant servers
The operating system will be FreeBSD or Linux
Quite a coincidence Frank.  It's a shame it has to be an HP solution.  I
just finished designing a high quality high performance 4U 72 drive
server yesterday that will easily handle 15K concurrent IMAP users, for
only ~$24K USD, $0.48/user @50K users.  So it may not be of interest to
you, but maybe to others.  It is capable of ~7K random 4KB r/w IOPS
sustained, has 10TB net space for an average ~200MB/user mail directory
assuming 50K users.  The parts for this machine run ~$24K USD at Newegg.
I just made the wishlist public so it should be available tomorrow or
Friday.  I'll provide the link when it's available.  All components used
are top quality, best available in the channel.  The reliability of the
properly assembled server will rival that of any HP/Dell/IBM machine.
For those not familiar with SuperMicro, they manufacture many of Intel's
retail boards and have for a decade+.  The majority of the COTS systems
used in large academic HPC clusters are built with SuperMicro chassis
and motherboards, as well as some 1000+ node US DOE clusters.  Here are
the basics:
72x 2.5" bay 4U chassis, 3x SAS backplanes each w/redundant expanders:
http://www.newegg.com/Product/Product.aspx?Item=N82E16811152212
78x Seagate 10K SAS 300GB drives--includes 6 spares
Triple LSI 9261-8i dual port 512MB BBWC RAID controllers each
with 2 redundant load balanced connections to a backplane
24 drives per controller for lowest latency, maximum throughput,
1.5GB total write cache, a rebuild affects only one controller, etc
SuperMicro mainboard, 2x 6-core 3.3GHz AMD Interlagos Opteron CPUs
64GB Reg ECC DDR3-1066, 8x8GB DIMMs, 34GB/s aggregate bandwidth
Dual Intel Quad port GbE NICs, 10 total Intel GbE ports
Use the 2 mobo ports for redundant management links
Aggregate 4 ports, 2 on each quad NIC, for mail traffic
Aggregate the remaining 4 for remote backup, future connection
to an iSCSI SAN array, etc
Or however works best--having 8 GbEs gives flexibility
and these two cards are only $500 of the total
2x Intel 20GB SSD internal fixed drives, hardware mirrored by the
onboard LSI SAS chip, for boot/OS
The key to performance, and yielding a single file tree, is once again
using XFS to take advantage of this large spindle count across 3 RAID
controllers.  Unlike previous configurations where I recommended using a
straight md concatenation of hardware RAID1 pairs, in this case we're
going to use a concatenation of 6 hardware RAID10 arrays.  There are a
couple of reasons for doing so in this case:

Using 36 device names in a single md command line is less than
intuitive and possibly error prone.  Using 6 is more manageable.

We have 3 BBWC RAID controllers w/24 drives each.  This is a high
performance server and will see a high IO load in production.  In many
cases one would use an external filesystem journal, which we could
easily do and get great performance with our mirrored SSDs.  However,
the SSDs are not backed by BBWC, so a UPS failure or system crash could
hose the log journal.  So we'll go with the default internal journal
which will be backed by the BBWC.

Going internal with the log in this mail scenario can cause a serious
amount of extra IOPS on the filesystem data section, this being
Allocation Group 0.  If we did the "normal" RAID1 concat, all the log IO
would hit the first RAID1 pair.  On this system, the load may hit that
spindle pretty hard, making access to mailboxes in AG0 slower than
others.  With 6 RAID10 arrays in a concat, the internal log writes will
be striped across 6 spindles in the first array.  With 512MB BBWC
backing that array and optimizing writeout, and with delaylog, this will
yield optimal log write performance without slowing down mailbox file
access in AG0.  To create such a setup we'd do something like this,
assuming the mobo LSI controller yields sd[ab], and the 6 array devices
on the PCIe LSI cards yield sd[cdefgh]

Create two RAID10 arrays, each of 12 drives, in the WebBIOS GUI of
each LSI card, using a strip size of 32KB which should yield good random
r/w performance for any mailbox format.  Use the following policies for
each array:  RW, Normal, Wback, Direct, Disable, No, and use the full
size.

Create the concatenated md device:
$ mdadm -C /dev/md0 -l linear -n 6 /dev/sd[cdefgh]
Then we format it with XFS, optimizing the AG layout for our mailbox
workload, and allocation write stripe alignment to each hardware array:
$ mkfs.xfs -d agcount=24 su=32k sw=6 /dev/md0
This yields 4 AGs per RAID10 array which will minimize the traditional
inode64 head seeking overhead on striped arrays, while still yielding
fantastic allocation parallelism with 24 AGs.
Optimal fstab for MTA queue/mailbox workload, assuming kernel 2.6.39+:
/dev/md0   /mail   xfs   defaults,inode64,nobarrier   0   0
We disable write barriers as we have BBWC.  And that 1.5GB of BBWC will
yield extremely low Dovecot write latency and throughput.
Given the throughput available, if you're running Postfix on this box,
you will want to create a directory on this filesystem for the Postfix
spool.  Postfix puts the spool files in many dozens, hundreds of
subdirectories, so you'll get 100% parallelism across all AGs, thus all
disks.
It's very likely none of you will decide to build this system.  My hope
is that some of the design concepts and components used, along with the
low cost but high performance of this machine, may be educational or
simply give people new ideas, steer them in directions they may not have
previously considered.
--
Stan

Re: [Dovecot] Configuration advices for a 50000 mailboxes server(s)

Stan Hoeppner