[Dovecot] Configuration advices for a 50000 mailboxes server(s)

Frank Bonnet

17 Apr 2012 17 Apr '12

10:54 a.m.

Hello

I need some feedbacks advices of experienced admins I will have to setup in few monthes an email system for approx 50K "intensives" users.

The only mandatory thing will be I must use HP proliant servers

The operating system will be FreeBSD or Linux

Thank you for any advices

Show replies by date

Mauricio López Riffo

17 Apr 17 Apr

2:54 p.m.

Frank,

 Here we have approx. 200K users with 4000 concurrent connections

(90% POP3 users) All servers in virtual environment Vmware, supermicro servers and Netapp Metrocluster storage solutions (nfs storage with 10G ethernet network) POP3 sessions take betwen 40 and 300 milisecons at connect, auth and list. All accounts lives in LDAP, CentOS 5 and exim like a mta relay.

Regards

El 17-04-2012 4:54, Frank Bonnet escribió:

...

Hello

I need some feedbacks advices of experienced admins I will have to setup in few monthes an email system for approx 50K "intensives" users.

The only mandatory thing will be I must use HP proliant servers

The operating system will be FreeBSD or Linux

Thank you for any advices

-- Mauricio López Riffo Red Hat Certified Engineer 804006455319519 Administrador de Servicios Internet Área Ingeniería Gtd Internet S.A. http://www.grupogtd.com/ Moneda 920, Oficina 602 - Fono : +562 4139742

Jan-Frode Myklebust

3:42 p.m.

On Tue, Apr 17, 2012 at 08:54:15AM -0300, Mauricio López Riffo wrote:

...

Here we have approx. 200K users with 4000 concurrent connections

(90% POP3 users)

How do you measure "concurrent" POP3 users?

...

All servers in virtual environment Vmware, supermicro servers and Netapp Metrocluster storage solutions (nfs storage with 10G ethernet network) POP3 sessions take betwen 40 and 300 milisecons at connect, auth and list. All accounts lives in LDAP, CentOS 5 and exim like a mta relay.

Very interesting config. We're close to 1M accounts, GPFS cluster fs, LDAP, RHEL5/6 and postfix + dovecot director for pop/imap/lmtp, and moving from maildir to mdbox.

What mailbox-format are you using? Do you have a director, or accounts sticky to a server some other way?

How's the NFS performance? I've always bean weary that NFS works terribly with many small files (i.e. maildir)..

What does the metrocluster give you? Is it for disaster recovery on second location, or do you have two active locations working against the same filesystem?

-jf

Mauricio López Riffo

4:10 p.m.

Jan,

...

...
How do you measure "concurrent" POP3 users?

We use cacti for metrics like concurrent connections or pop3 delay and Zabbix for alarms.

...

...
What mailbox-format are you using? Do you have a director, or accounts sticky to a server some other way?

Maildir like a mailformat and currently without a director, but we testing a new environment with director to reduce number of servers (7 virtual servers with 4 Vcpu, 6gb ram) At the meantime a lvs piranha do the work of steaky connections but its not enought, for this reason will use a director.

...

...
How's the NFS performance? I've always bean weary that NFS works terribly with many small files (i.e. maildir)..

At peak hours our storage traffic about 10 thousend total ops (bewten read and write, 80/20) and cpu mark 70% of use.

...

...
What does the metrocluster give you? Is it for disaster recovery on second location, or do you have two active locations working against the same filesystem?

Our storage have two missions, first hight availability because we have two datacenter (environment N +1) and posibility of use two storages in cluster for the same filesystems en this two datacenters (Vmware and Mail storage resides in NFS filesystem)

...

...
Very interesting config. We're close to 1M accounts, GPFS cluster fs, LDAP, RHEL5/6 and postfix + dovecot director for pop/imap/lmtp, and moving from maildir to mdbox.

1M = 1 milion ? How many servers you have? hardware?

Any help o contribution, you welcome :)

Regards

El 17-04-2012 9:42, Jan-Frode Myklebust escribió:

...

On Tue, Apr 17, 2012 at 08:54:15AM -0300, Mauricio López Riffo wrote:

...
 Here we have approx. 200K users with 4000 concurrent connections
(90% POP3 users)
How do you measure "concurrent" POP3 users?

...
All servers in virtual environment Vmware, supermicro servers and Netapp Metrocluster storage solutions (nfs storage with 10G ethernet network) POP3 sessions take betwen 40 and 300 milisecons at connect, auth and list. All accounts lives in LDAP, CentOS 5 and exim like a mta relay. Very interesting config. We're close to 1M accounts, GPFS cluster fs, LDAP, RHEL5/6 and postfix + dovecot director for pop/imap/lmtp, and moving from maildir to mdbox.

What mailbox-format are you using? Do you have a director, or accounts sticky to a server some other way?

How's the NFS performance? I've always bean weary that NFS works terribly with many small files (i.e. maildir)..

What does the metrocluster give you? Is it for disaster recovery on second location, or do you have two active locations working against the same filesystem?

-jf

Jan-Frode Myklebust

11:08 p.m.

On Tue, Apr 17, 2012 at 10:10:02AM -0300, Mauricio López Riffo wrote:

...

1M = 1 milion ?

976508 to be exact :-) but it's very much a useless number. Lots and lots of these are inactive. A better number is probably that we're seeing about 80 logins/second for the last hour.. (just checked now, not sure if this is the most busy hour or not).

...

How many servers you have? hardware?

7 backend dovecot servers (two IBM x336, three x346 and two x3550, with a 8 GB for the x336/x346 and 16GB memory memory for the x3550's). 2 frontend dovecot directors (IBM x3550).

None of these are really very busy, so we could probably reduce the number of backends a bit if we wanted. Our struggle is the number of iops we're able to get from the backend storage (IBM DS4800), mostly a problem when we have storms of incoming marketing messages in addition to the pop/imap traffic.

-jf

Stan Hoeppner

20 Apr 20 Apr

3:31 a.m.

On 4/17/2012 3:08 PM, Jan-Frode Myklebust wrote:

...

Our struggle is the number of iops we're able to get from the backend storage (IBM DS4800), mostly a problem when we have storms of incoming marketing messages in addition to the pop/imap traffic.

This issue has come up twice on the Postfix list in less than a month. You can fix this specific problem very easily. Only marketing servers and busy/misconfigured list servers make many parallel connections to your MX hosts. Allowing them to blast all those messages over parallel connections is what bogs down your spool storage. The fix is simple: limit all SMTP clients to a small number of parallel connections. This will slow down marketing and list server blasts without affecting normal sending MTAs. To do so, add this to /etc/postfix/main.cf:

smtpd_client_connection_count_limit = 4*

The default Postfix process limit is 100. The concurrent connection limit is 1/2 the process limit, so 50 parallel connections per client IP are allowed by default. If remote hosts also do connection caching, they can force feed your MTA many hundreds of messages/sec. Limiting concurrent connections will decrease their mail rate to a small fraction of what you're seeing now, reducing IOPS load on your spool storage significantly.

This is a good starting point. You may need to tweak it up a little bit. Some list servers (such as XFS) will unsub members if their multiple connections keep getting refused, so tweak this value until you find your sweet spot.

-- Stan

Jan-Frode Myklebust

21 Apr 21 Apr

12:52 p.m.

On Thu, Apr 19, 2012 at 07:31:13PM -0500, Stan Hoeppner wrote:

...

This issue has come up twice on the Postfix list in less than a month.

Oh, thanks! I'll look into those list posts.. I had mostly given up solving this by rate limits and decided to throw hardware at the problem when I saw the log entries for sender *.anpdm.com.. Seems to be a newsletter sender, which I found as 203 different mailserver ip-addresses in our incoming mailserver logs, from 53 different B-nets and 8 different A-nets.

Will give smtpd_client_connection_count_limit a try..

-jf

Stan Hoeppner

10:59 p.m.

On 4/21/2012 4:52 AM, Jan-Frode Myklebust wrote:

...

On Thu, Apr 19, 2012 at 07:31:13PM -0500, Stan Hoeppner wrote:

...
This issue has come up twice on the Postfix list in less than a month.

Oh, thanks! I'll look into those list posts.. I had mostly given up solving this by rate limits and decided to throw hardware at the problem when I saw the log entries for sender *.anpdm.com.. Seems to be a newsletter sender, which I found as 203 different mailserver ip-addresses in our incoming mailserver logs, from 53 different B-nets and 8 different A-nets.

Yeah, they're a newsletter service provider.

...

Will give smtpd_client_connection_count_limit a try..

Setting this to 1 or 2 should severely slow their delivery rate. You can also do rate limiting at a much more fine grained level with a Postfix policy daemon such as postfwd (Postfix firewall daemon), though the setup is a bit more complicated.

-- Stan

Frank Bonnet

17 Apr 17 Apr

4:01 p.m.

hello

Thanks for your answer , MY problem will be IMAPS connections I dunno how much I will have but it would be possible that we'll have 4000/6000 imaps concurent connections during working hours .

POP3 users will be very few

Le 17/04/2012 13:54, Mauricio López Riffo a écrit :

...

Frank,
Here we have approx. 200K users with 4000 concurrent connections 
(90% POP3 users) All servers in virtual environment Vmware, supermicro servers and Netapp Metrocluster storage solutions (nfs storage with 10G ethernet network) POP3 sessions take betwen 40 and 300 milisecons at connect, auth and list. All accounts lives in LDAP, CentOS 5 and exim like a mta relay.

Regards

El 17-04-2012 4:54, Frank Bonnet escribió:

...
Hello

I need some feedbacks advices of experienced admins I will have to setup in few monthes an email system for approx 50K "intensives" users.

The only mandatory thing will be I must use HP proliant servers

The operating system will be FreeBSD or Linux

Thank you for any advices

Stan Hoeppner

19 Apr 19 Apr

12:40 p.m.

On 4/17/2012 8:01 AM, Frank Bonnet wrote:

...

have 4000/6000 imaps concurent connections during working hours .

POP3 users will be very few

How much disk space do you plan to offer per user mail directory? Will you be using quotas?

...

...
...
I need some feedbacks advices of experienced admins I will have to setup in few monthes an email system for approx 50K "intensives" users.

The only mandatory thing will be I must use HP proliant servers

The operating system will be FreeBSD or Linux

Quite a coincidence Frank. It's a shame it has to be an HP solution. I just finished designing a high quality high performance 4U 72 drive server yesterday that will easily handle 15K concurrent IMAP users, for only ~$24K USD, $0.48/user @50K users. So it may not be of interest to you, but maybe to others. It is capable of ~7K random 4KB r/w IOPS sustained, has 10TB net space for an average ~200MB/user mail directory assuming 50K users. The parts for this machine run ~$24K USD at Newegg. I just made the wishlist public so it should be available tomorrow or Friday. I'll provide the link when it's available. All components used are top quality, best available in the channel. The reliability of the properly assembled server will rival that of any HP/Dell/IBM machine. For those not familiar with SuperMicro, they manufacture many of Intel's retail boards and have for a decade+. The majority of the COTS systems used in large academic HPC clusters are built with SuperMicro chassis and motherboards, as well as some 1000+ node US DOE clusters. Here are the basics:

72x 2.5" bay 4U chassis, 3x SAS backplanes each w/redundant expanders: http://www.newegg.com/Product/Product.aspx?Item=N82E16811152212 78x Seagate 10K SAS 300GB drives--includes 6 spares Triple LSI 9261-8i dual port 512MB BBWC RAID controllers each with 2 redundant load balanced connections to a backplane 24 drives per controller for lowest latency, maximum throughput, 1.5GB total write cache, a rebuild affects only one controller, etc SuperMicro mainboard, 2x 6-core 3.3GHz AMD Interlagos Opteron CPUs 64GB Reg ECC DDR3-1066, 8x8GB DIMMs, 34GB/s aggregate bandwidth Dual Intel Quad port GbE NICs, 10 total Intel GbE ports Use the 2 mobo ports for redundant management links Aggregate 4 ports, 2 on each quad NIC, for mail traffic Aggregate the remaining 4 for remote backup, future connection to an iSCSI SAN array, etc Or however works best--having 8 GbEs gives flexibility and these two cards are only $500 of the total 2x Intel 20GB SSD internal fixed drives, hardware mirrored by the onboard LSI SAS chip, for boot/OS

The key to performance, and yielding a single file tree, is once again using XFS to take advantage of this large spindle count across 3 RAID controllers. Unlike previous configurations where I recommended using a straight md concatenation of hardware RAID1 pairs, in this case we're going to use a concatenation of 6 hardware RAID10 arrays. There are a couple of reasons for doing so in this case:

Using 36 device names in a single md command line is less than intuitive and possibly error prone. Using 6 is more manageable.
We have 3 BBWC RAID controllers w/24 drives each. This is a high performance server and will see a high IO load in production. In many cases one would use an external filesystem journal, which we could easily do and get great performance with our mirrored SSDs. However, the SSDs are not backed by BBWC, so a UPS failure or system crash could hose the log journal. So we'll go with the default internal journal which will be backed by the BBWC.

Going internal with the log in this mail scenario can cause a serious amount of extra IOPS on the filesystem data section, this being Allocation Group 0. If we did the "normal" RAID1 concat, all the log IO would hit the first RAID1 pair. On this system, the load may hit that spindle pretty hard, making access to mailboxes in AG0 slower than others. With 6 RAID10 arrays in a concat, the internal log writes will be striped across 6 spindles in the first array. With 512MB BBWC backing that array and optimizing writeout, and with delaylog, this will yield optimal log write performance without slowing down mailbox file access in AG0. To create such a setup we'd do something like this, assuming the mobo LSI controller yields sd[ab], and the 6 array devices on the PCIe LSI cards yield sd[cdefgh]

Create two RAID10 arrays, each of 12 drives, in the WebBIOS GUI of each LSI card, using a strip size of 32KB which should yield good random r/w performance for any mailbox format. Use the following policies for each array: RW, Normal, Wback, Direct, Disable, No, and use the full size.

Create the concatenated md device: $ mdadm -C /dev/md0 -l linear -n 6 /dev/sd[cdefgh]

Then we format it with XFS, optimizing the AG layout for our mailbox workload, and allocation write stripe alignment to each hardware array: $ mkfs.xfs -d agcount=24 su=32k sw=6 /dev/md0

This yields 4 AGs per RAID10 array which will minimize the traditional inode64 head seeking overhead on striped arrays, while still yielding fantastic allocation parallelism with 24 AGs.

Optimal fstab for MTA queue/mailbox workload, assuming kernel 2.6.39+: /dev/md0 /mail xfs defaults,inode64,nobarrier 0 0

We disable write barriers as we have BBWC. And that 1.5GB of BBWC will yield extremely low Dovecot write latency and throughput.

Given the throughput available, if you're running Postfix on this box, you will want to create a directory on this filesystem for the Postfix spool. Postfix puts the spool files in many dozens, hundreds of subdirectories, so you'll get 100% parallelism across all AGs, thus all disks.

It's very likely none of you will decide to build this system. My hope is that some of the design concepts and components used, along with the low cost but high performance of this machine, may be educational or simply give people new ideas, steer them in directions they may not have previously considered.

-- Stan

Stan Hoeppner

21 Apr 21 Apr

3:22 a.m.

On 4/19/2012 4:40 AM, Stan Hoeppner wrote:

...

On 4/17/2012 8:01 AM, Frank Bonnet wrote:

...
have 4000/6000 imaps concurent connections during working hours .

...

...
...
...
for approx 50K "intensives" users.

The only mandatory thing will be I must use HP proliant servers

The operating system will be FreeBSD or Linux

...

I just made the wishlist public so it should be available tomorrow or Friday. I'll provide the link when it's available.

And here it is: http://secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=16797...

Since your requirement is for an HP solution, following is an HP server and storage system solution of roughly identical performance and redundancy to the SuperMicro based system I detailed. The HP system solution is $44,263, almost double the cost at $20,000 more. Due to the stupidity of Newegg requiring all wish lists to be reviewed before going live, I'll simply provide the links to all the products.

Yes boys and girls, Newegg isn't just consumer products. They carry nearly the entire line of HP Proliant servers and storage, including the 4-way 48-core Opteron DL585 G7 w/64GB, the P2000 fiber channel array, and much more. In this case they sell every product needed to assemble this complete mail server solution:

The 9280-8e RAID controllers are identical to 9261-8i boards but have 2 external vs internal x4 6Gb SAS ports. I spec them instead of the Smart Array boards as they're far cheaper, easier to work with, and offer equal or superior performance. Thus everything written below is valid for this system as well, with the exception that you would configure 1 global hot spare in each chassis since these units have 25 drive bays instead of 24. The D2700 units come with 20" 8088 cables. I an additional spec'd two 3ft cables to make sure we reach all 3 disk chassis from the server, thinking the sever would be on top with the 3 disk chassis below.

I hope this and my previous post are helpful in one aspect or another to Frank and anyone else. I spent more than a few minutes on these designs. ;) Days in fact on the SuperMicro design, only a couple of hours on the HP. It wouldn't have taken quite so long if all PCIe slots were created equal (x8), which they're not, or if modern servers didn't require 4 different types of DIMMs depending on how many slots you want to fill and how much expansion capacity you need without having to throw out all the previous memory, which many folks end up doing out of ignorance. Memory configuration is simply too darn complicated with high cap servers containing 8 channels and 24 slots.

...

The key to performance, and yielding a single file tree, is once again using XFS to take advantage of this large spindle count across 3 RAID controllers. Unlike previous configurations where I recommended using a straight md concatenation of hardware RAID1 pairs, in this case we're going to use a concatenation of 6 hardware RAID10 arrays. There are a couple of reasons for doing so in this case:

Using 36 device names in a single md command line is less than intuitive and possibly error prone. Using 6 is more manageable.

We have 3 BBWC RAID controllers w/24 drives each. This is a high performance server and will see a high IO load in production. In many cases one would use an external filesystem journal, which we could easily do and get great performance with our mirrored SSDs. However, the SSDs are not backed by BBWC, so a UPS failure or system crash could hose the log journal. So we'll go with the default internal journal which will be backed by the BBWC.

Going internal with the log in this mail scenario can cause a serious amount of extra IOPS on the filesystem data section, this being Allocation Group 0. If we did the "normal" RAID1 concat, all the log IO would hit the first RAID1 pair. On this system, the load may hit that spindle pretty hard, making access to mailboxes in AG0 slower than others. With 6 RAID10 arrays in a concat, the internal log writes will be striped across 6 spindles in the first array. With 512MB BBWC backing that array and optimizing writeout, and with delaylog, this will yield optimal log write performance without slowing down mailbox file access in AG0. To create such a setup we'd do something like this, assuming the mobo LSI controller yields sd[ab], and the 6 array devices on the PCIe LSI cards yield sd[cdefgh]

Create two RAID10 arrays, each of 12 drives, in the WebBIOS GUI of each LSI card, using a strip size of 32KB which should yield good random r/w performance for any mailbox format. Use the following policies for each array: RW, Normal, Wback, Direct, Disable, No, and use the full size.

Create the concatenated md device: $ mdadm -C /dev/md0 -l linear -n 6 /dev/sd[cdefgh]

Then we format it with XFS, optimizing the AG layout for our mailbox workload, and allocation write stripe alignment to each hardware array: $ mkfs.xfs -d agcount=24 su=32k sw=6 /dev/md0

This yields 4 AGs per RAID10 array which will minimize the traditional inode64 head seeking overhead on striped arrays, while still yielding fantastic allocation parallelism with 24 AGs.

Optimal fstab for MTA queue/mailbox workload, assuming kernel 2.6.39+: /dev/md0 /mail xfs defaults,inode64,nobarrier 0 0

We disable write barriers as we have BBWC. And that 1.5GB of BBWC will yield extremely low Dovecot write latency and throughput.

Given the throughput available, if you're running Postfix on this box, you will want to create a directory on this filesystem for the Postfix spool. Postfix puts the spool files in many dozens, hundreds of subdirectories, so you'll get 100% parallelism across all AGs, thus all disks.

It's very likely none of you will decide to build this system. My hope is that some of the design concepts and components used, along with the low cost but high performance of this machine, may be educational or simply give people new ideas, steer them in directions they may not have previously considered.

-- Stan

Cor Bosman

17 Apr 17 Apr

8:57 p.m.

...

Here we have approx. 200K users with 4000 concurrent connections  
(90% POP3 users) All servers in virtual environment Vmware, supermicro
servers and Netapp Metrocluster storage solutions (nfs storage with 10G
ethernet network) POP3 sessions take betwen 40 and 300 milisecons at
connect, auth and list. All accounts lives in LDAP, CentOS 5 and exim
like a mta relay.

Similar setup here. Maybe 15-20K concurrent connections, imap only (pop is not handled by dovecot yet) about 800K mailboxes. We have all bare metal servers, linux. Currently 35 servers handling the load easily. We could run on 1/3rd of the load probably.

In front of the 35 servers are 3 directors, handling imap only, although im in the process of adding LMTP/sieve to the mix.

Backend storage is NetApp Metrocluster over 2 datacenters.

Cor

4870

Age (days ago)

4874

Last active (days ago)

List overview

11 comments

5 participants

participants (5)

Cor Bosman
Frank Bonnet
Jan-Frode Myklebust
Mauricio López Riffo
Stan Hoeppner