Sven, why didn't you chime in? Your setup is similar scale and I think your insights would be valuable here. Or maybe you could repost your last on this topic. Or was that discussion off list? I can't recall.
Anyway, I missed this post Murray. Thanks Ed for drudging this up. Maybe this will give you some insight, or possibly confuse you. :)
On 1/5/2014 7:06 AM, Murray Trainer wrote:
Hi All,
I am trying to determine whether a mail server cluster based on Dovecot will be capable of supporting 500,000+ mailboxes with about 50,000 IMAP and 5000 active POP3 connections. I have looked at the Dovecot clustering suggestions here:
http://blog.dovecot.org/2012/02/dovecot-clustering-with-dsync-based.html
and some other Dovecot mailing list threads but I am not sure how many users such a setup will handle. I have a concern about the I/O performance of NFS in the suggested architecture above. One possible option available to us is to split up the mailboxes over multiple clusters with subsets of domains. Is there anyone out there currently running this many users on a Dovecot based mail cluster? Some suggestions or advice on the best way to go would be greatly appreciated.
As with MTAs Dovecot requires miniscule CPU power for most tasks. Body searches are the only operations that eat meaningful CPU, and only when indexes aren't up to date.
As with MTAs, mailbox server performance is limited by disk IO, but it is also limited by memory capacity as IMAP connections are long lived, unlike an MTA where each lasts a few seconds.
Thus, very similar to the advice I gave you WRT MTAs, you can do this with as few as two hosts in the cluster, or as many as you want. You simply need sufficient memory for concurrent user connections, and sufficient disk IO.
The architecture of the IO subsystem depends greatly on which mailbox format you plan to use. Maildir is extremely metadata heavy and thus does not perform all that well with cluster filesystems such as OCFS or GFS, no matter how fast the SAN array controller and disks may be. It can work well with NFS. Mdbox isn't metadata heavy and works much better with cluster filesystems.
Neither NFS nor a cluster filesystem setup can match the performance of a standalone filesystem on direct attached disk or a SAN LUN. But standalone filesystems make less efficient use of total storage capacity. And if using DAS failover, resiliency, etc are far less than optimal.
With correct mail routing from your MTAs to your Dovecot servers, and with Dovecot director, you can use any of these architectures. Which one you choose boils down to:
- Ease of management
- Budget
- Storage efficiency
The NFS and cluster filesystem solutions are generally significantly more expensive than filesystem on DAS, because the NFS server and SAN array required for 500,000 mailboxes are costly. If you go NFS you better get a NetApp filer. Not just for the hardware, snapshots, etc, but for the engineering support expertise. They know NFS better than the Pope knows Jesus and can get you tuned for max performance.
Standalone servers/filesystems with local disk give you dramatically more bang for the buck. You can handle the same load with fewer servers and with quicker response times. You can use SAN storage instead of direct attach, but at cost equivalent to the cluster filesystem architecture. You'll then benefit from storage efficiency, PIT snapshots, etc.
Again, random disk IOPS is the most important factor wil mailbox storage. With 50K logged in IMAP users and 5K POP3 users, we simply have to guesstimate IOPS if you don't already have this data. I assume you don't as you didn't provide it. It is the KEY information required to size your architecture properly, and in the most cost effective manner.
Lets assume for argument sake that your 50K concurrent IMAP users and your 5K POP users generate 8,000 IOPS, which is probably a high guess. 10K SAS drives do ~225 IOPS.
8000/225= 36 disks * 2 for RAID10 = 72
So as a wild ass guesstimate you'd need approximately 72 SAS drives in multiple at 10K spindle speed for this workload. If you need to use high cap 7.2K SATA or SAS drives to meet your offered mailbox capacity you'll need 144 drives.
Whether you go NFS, cluster on SAN, or standalone filesystems on SAN, VMware with HA, Vmotion, etc, is a must, as it gives you instant host failover and far easier management that KVM, Xen, etc.
On possible hardware solution consists of:
Qty 1. HP 4730 SAN controller with 25x 600GB 10K SAS drives Qty 3. Expansion chassis for 75 drives, 45TB raw capacity, 21.6TB net after one spare per chassis and RAID10, 8100 IOPS. Qty 2. Dell PowerEdge 320, 4 core Xeon and 96GB RAM, Dovecot Qty 1. HP ProLiant DL320e with 8GB RAM running Dovecot Director
You'd run ESX on each Dell with one Linux guest per physical box. Each guest would be allocated 46GB of RAM to facilitate failover. This much RAM is rather costly, but Vmware licenses are far more, so it saves money using a beefy 2 box cluster vs a 3/4 box cluster of weaker machines. You'd create multiple RAID10 arrays using a 32KB strip size on the 4730 of equal numbers of disks, and span the RAID sets into 2 volumes. You'd export each volume as a LUN to both ESX hosts. You'd create an RDM of each LUN and assign one RDM to each of your guests. Each guest would format its RDM with
~# mkfs.xfs "-d agcount=24" /dev/[device]
giving you 24 allocation groups for parallelism. Do -not- align XFS (sunit/swidth) with a small file random IO workload. It will murder performance. You get two 10TB filesystems, each for 250,000 mailboxes, or ~44MB average per mailbox. If that's not enough storage, buy the 900GB drives for 66MB/mailbox. If that's still not enough, use more expansion chassis and more RAID sets per volume, or switch to a large cap SAS/SATA model. With 50K concurrent users, don't even think about using RAID5/6. The RMW will murder performance and then urinate on its grave.
With HA configured, if one box or one guest dies, the guest will automatically be restarted on the remaining host. Since both hosts see both LUNs, and RDMs, the guest boots up and has its filesystem. This is an infinitely better solution than a single shared cluster filesystem. The dual XFS filesystems will be much faster. If the CFS gets corrupted all your users are down--with two local filesystems only half the users are down. Check/repair of a 20TB GFS2/OCFS2 filesystem will take -much- longer than xfs_repair on a 10TB FS, possibly hours one you have all 500K mailboxes on it. Etc, etc.
-- Stan