[Dovecot] Please help to make decision

Tigran Petrosyan

24 Mar 2013 24 Mar '13

6:12 p.m.

Hi We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

Show replies by date

Timo Sirainen

24 Mar 24 Mar

8:45 p.m.

On 24.3.2013, at 18.12, Tigran Petrosyan <tpetrosy@gmail.com> wrote:

...

We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

I remember people complaining about GFS2 (and other cluster filesystems) having bad performance. But in any case whatever you use, be sure to use also http://wiki2.dovecot.org/Director Even if it's not strictly needed, it improves the performance with GFS2.

Object storages also scale nicely (e.g. Scality). For best performance with them you'd need Dovecot object storage plugin (not open source).

Stan Hoeppner

25 Mar 25 Mar

3:45 a.m.

On 3/24/2013 1:45 PM, Timo Sirainen wrote:

...

On 24.3.2013, at 18.12, Tigran Petrosyan <tpetrosy@gmail.com> wrote:

...
We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

This greatly depends upon whose cluster NFS storage product we're talking about.

...

I remember people complaining about GFS2 (and other cluster filesystems) having bad performance. But in any case whatever you use, be sure to use also http://wiki2.dovecot.org/Director Even if it's not strictly needed, it improves the performance with GFS2.

GFS2 and OCFS2 performance is suffers when using maildir due to the filesystem metadata being broadcast amongst all nodes, thus creating high latency and low metadata IOPS. The more nodes the worse this problem becomes. If using old fashioned UNIX mbox, or Dovecot mdbox with a good deal of emails per file then this isn't as much of an issue as metadata changes are few. If using maildir with a cluster filesystem using a small number of fat nodes is mandatory to minimize metadata traffic.

If using fiber channel SAN storage with your cluster filesystem, keep in mind that one single port 8gb HBA and its SFP transceiver costs significantly more than a 1U server. Given that a single 8gb FC port can carry (800MB/s / 32KB)= 25,600 emails per second, or 2.2 billion emails/day, fat nodes make more sense from a financial standpoint as well. Mail workloads don't require much CPU, but need low latency disk and network IO, and lots of memory. Four dual socket Opteron 8-core servers (16c per server), 128GB RAM, two single port 8gb FC HBAs with SCSI multipath, dual GbE ports for user traffic and dual GbE for GFS2 metadata, should fit the bill nicely.

Any quality high performance SAN head with 2-4 ports per dual controller, or multiple SAN heads, that can expand to 480 or more drives, is suitable. If the head has only 4 ports total you will need an FC switch with at least 8 ports, preferably two switches with minimum 4 ports each (8 is the smallest typically available)--this provides maximum redundancy as you survive a switch failure. For transactional workloads you never want to use parity as the RMW cycles that result for smaller than stripe width writes degrade write throughput by a factor of 5:1 or more compared to non-parity RAID. So RAID10 is the only game in town, thus you need lots of spindles. With 480x 600GB SAS 15K drives (4x 60 bay 4U chassis) and 16 spares you have 464 drives configured in 29 RAID10 arrays of 16 drives, 4.8TB raw per array, and yielding an optimal 8x 32KB stripe width of 256KB. You would format each 4.8TB exported LUN with GFS2, yielding 29 cluster filesystems, with ~35K user mail directories on each. If you have a filesystem problem and must run a check/repair, or even worse restore from tape or D2D, you're only affecting up to 1/29th, or 35K of your 1M users. If you feel this is too many filesystems to manage you can span arrays with the controller firmware or with mdraid/lvm2. And of course you will need a box dedicated to Director, which will spread connections across your 4 server nodes.

This is not a complete "how-to" obviously, but should give you some pointers/ideas on overall architecture options and best practices.

-- Stan

Noel Butler

2:21 a.m.

On Sun, 2013-03-24 at 20:12 +0400, Tigran Petrosyan wrote:

...

Hi We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

I'd recommend NFS, very easy to scale and excellent performance, we limit 8K simultaneous connections per server, they could do lot more, but never seen close to that anyway, usually at peaks is around 5K per box, however, thats pop3 - imap is only used for webmail.

If you are only doing pop3, use INDEX=MEMORY as well, eg: mail_location = maildir:/var/vmail/%d/%1n/%1.1n/%2.1n/% n/Maildir:INDEX=MEMORY

But if using imap, then I understand dovecots director (we dont use it) is better solution

list＠airstreamcomm.net

3:51 p.m.

On 3/24/13 11:12 AM, Tigran Petrosyan wrote:

...

Hi We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

NFS has worked well for us on a 65,000 user Dovecot cluster. We use a dual controller NetApp in cluster mode which give great performance.
You might also consider looking at the commercial version of Dovecot which has the Object Storage plugin, which might suit your scalability needs much better (size and especially budget wise). I would also recommend testing with actual work loads similar to what you plan on implementing. Our team developed a mail generating botnet in which we ran SMTP/IMAP/POP tests where we could control levels of each.

Thierry de Montaudry

8:47 p.m.

Hi Tigran,

Managing a mail system for 1M odd users, we did run for a few years on some high range SAN system (NetApp, then EMC), but were not happy with the performance, whatever double head, fibre, and so on, it just couldn't handle the IOs. I must just say that at this time, we were not using dovecot.

Then we moved to a completely different structure: 24 storage machines (plain CentOS as NFS servers), 7 frontend (webmail through IMAP + POP3 server) and 5 MXs, and all front end machines running dovecot. That was a major change in the system performances, but not happy yet with the 50T total storage we had. Having huge traffic between front end machine and storage, and at this time, I was not sure the switches were handling the load properly. Not talking about the load on the front end machine which some times needed a hard reboot to recover from NFS timeouts. Even after trying some heavy optimizations all around, and particularly on NFS.

Then we did look at the Dovecot director, but not sure how it would handle 1M users, we moved to the proxy solution: we are now running dovecot on the 24 storage machines, our webmail system connecting with IMAP to the final storage machine, as well as the MXs with LMTP, we only use dovecot proxy for the POP3 access on the 7 front end machines. And I must say, what a change. Since then the system is running smoothly, no more worries about NFS timeouts and the loadavg on all machine is down to almost nothing, as well as the internal traffic on the switches and our stress. And most important, the feed back from our users told us that we did the right thing.

Only trouble: now and then we have to move users around, as if a machine gets full, the only solution is to move data to one that has more space. But this is achieved easily with the dsync tool.

This is just my experience, it might not be the best, but with the (limited) budget we had, we finally came up with a solutions that can handle the load and got us away from SAN systems which could never handle the IOs for mail access. Just for the sake of it, our storage machines only have each 4 x 1T SATA drives in RAID 10, and 16G of mem, which I've been told would never do the job, but it just works. Thanks Timo.

Hoping this will help in your decision,

Regards,

Thierry

On 24 Mar 2013, at 18:12, Tigran Petrosyan <tpetrosy@gmail.com> wrote:

...

Hi We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

Ed W

28 Mar 28 Mar

10:34 p.m.

I believe a variation on that theme is also to "double" each machine using DRBD so that machines are arranged in pairs. One can fail and the other will take over the load. ie each pair of machines mirrors the storage for the other. With this arrangement only warm failover is usually required and hence DRBD can run in async mode and performance impact is low

Note I don't use any of the above, it was a setup described by Timo some years back

Good luck

Ed W

On 25/03/2013 18:47, Thierry de Montaudry wrote:

...

Hi Tigran,

Managing a mail system for 1M odd users, we did run for a few years on some high range SAN system (NetApp, then EMC), but were not happy with the performance, whatever double head, fibre, and so on, it just couldn't handle the IOs. I must just say that at this time, we were not using dovecot.

Then we moved to a completely different structure: 24 storage machines (plain CentOS as NFS servers), 7 frontend (webmail through IMAP + POP3 server) and 5 MXs, and all front end machines running dovecot. That was a major change in the system performances, but not happy yet with the 50T total storage we had. Having huge traffic between front end machine and storage, and at this time, I was not sure the switches were handling the load properly. Not talking about the load on the front end machine which some times needed a hard reboot to recover from NFS timeouts. Even after trying some heavy optimizations all around, and particularly on NFS.

Then we did look at the Dovecot director, but not sure how it would handle 1M users, we moved to the proxy solution: we are now running dovecot on the 24 storage machines, our webmail system connecting with IMAP to the final storage machine, as well as the MXs with LMTP, we only use dovecot proxy for the POP3 access on the 7 front end machines. And I must say, what a change. Since then the system is running smoothly, no more worries about NFS timeouts and the loadavg on all machine is down to almost nothing, as well as the internal traffic on the switches and our stress. And most important, the feed back from our users told us that we did the right thing.

Only trouble: now and then we have to move users around, as if a machine gets full, the only solution is to move data to one that has more space. But this is achieved easily with the dsync tool.

This is just my experience, it might not be the best, but with the (limited) budget we had, we finally came up with a solutions that can handle the load and got us away from SAN systems which could never handle the IOs for mail access. Just for the sake of it, our storage machines only have each 4 x 1T SATA drives in RAID 10, and 16G of mem, which I've been told would never do the job, but it just works. Thanks Timo.

Hoping this will help in your decision,

Regards,

Thierry

On 24 Mar 2013, at 18:12, Tigran Petrosyan <tpetrosy@gmail.com> wrote:

...
Hi We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

Stan Hoeppner

29 Mar 29 Mar

7:23 a.m.

On 3/28/2013 3:34 PM, Ed W wrote:

...

I believe a variation on that theme is also to "double" each machine using DRBD so that machines are arranged in pairs. One can fail and the other will take over the load. ie each pair of machines mirrors the storage for the other. With this arrangement only warm failover is usually required and hence DRBD can run in async mode and performance impact is low

This is an active/passive setup, and doubles your hardware costs across the board, with no parallel performance gain. This is not financially feasible for 1M users. Going active/active would be better as you can cut in half the number of server nodes required. But here you must use a cluster filesystem, and you're still buying double the quantity of disks that are needed.

At this scale it is much more cost effective to acquire 4 midrange FC/iSCSI SAN heads with 120x 15K 600GB SAS drives each, 480 total. With RAID10 you get 144TB net capacity. An active/active DRBD solution would require 960 drives instead of 480 for the same net storage and IOPS. These drives run about $400 USD in such a bulk purchase depending on vendor. That's an extra ~$192,0000 wasted on drives. Not to mention all the extra JBOD chassis required, and more importantly the extra power/cooling cost. You can obtain 4 low frills high performance midrange SAN heads for quite a bit less than that $192,000. The Nexsan E60 comes to mind. Four FC SAN heads each with dual active/active controllers and four 8Gb FC ports plus four expansion chassis, w/480x 600GB 15K drives in 32U, leaving 8U at the bottom of the rack for the 10KVA UPS needed to power them.

-- Stan

Stan Hoeppner

7:48 a.m.

...

On 25/03/2013 18:47, Thierry de Montaudry wrote:

...

...
This is just my experience, it might not be the best, but with the (limited) budget we had, we finally came up with a solutions that can handle the load and got us away from SAN systems which could never handle the IOs for mail access. Just for the sake of it, our storage machines only have each 4 x 1T SATA drives in RAID 10, and 16G of mem, which I've been told would never do the job, but it just works. Thanks Timo.

With only 48 effective 7.2k data spindles and 1M users, this would tend to suggest that only a tiny fraction of your users are logged in and performing IOs at any point in time.

Number of active sessions dictates IOPS requirements, not total #mailboxes, and the former may be drastically different between these two 1M user sites. If 500K of your 1M users were logged in concurrently via webmail I'd guess the heads of those 96 drives would hit their peak seek rate instantly and remain there, and iowait would go through the roof.

My previous posts in this thread make the assumption that the worst case scenario to architect for is 500K logged in active IMAP users at a given PIT.

-- Stan

Urban Loesch

30 Mar 30 Mar

2:43 p.m.

Hi,

we have similar setup like Thierry, but not so big. Only 40k users and 1,2T of used space. Only 300 concurrent POP3 and 1600 IMAP sessions. Imap is increasing continously.

Due to the fact that we have a low budget we impelented the following small solution.

2 static IMAP/POP3 Proxies (no director) load balanced with the well known CLUSTERIP module from iptables (poors man loadbalancing, but works only in layer 2 envirmonments. Works great for our needs and would be scalable too)
2 static SMTP relayservers load balanced the same way as above.
4 storage machines in active/passive setup with DRBD on top of LVM2. On each active node are running 4-5 virtual containers (based on http://linux-vserver.org). All 40k accounts a spread on this 8 containers. This has the advantage to quickly move the hole container from one storage machine to another wihout dsync if there is not enough space on some node.
2 Mysql master/master containers to store userinformation which then are be cached by dovecot itself. This extremly reduces database load.

All servers (proxies, relayserver dovecot, mysql are containers). So we can move them around on different hardware without changing any configuration. But this happens rarely.

Dovecot uses mdbox storage format with compression enabled. No problems yet. Index and mdbox files are stored on different mount points. This gives us the chance to move them easily to different spindles if we need. In the future we plan to store indexes on SSD's and mdbox files on SATA drives, as in fact the main IO happens on index files and the use of disk space is increasing.

As mentioned above, this is not a big setup, but for our needs it works very good and stable. Helps us to save money and problems with NFS and SAN's, etc. And it can be scaled out very easy.

Regards Urban

Am 25.03.2013 19:47, schrieb Thierry de Montaudry:

...

Hi Tigran,

Managing a mail system for 1M odd users, we did run for a few years on some high range SAN system (NetApp, then EMC), but were not happy with the performance, whatever double head, fibre, and so on, it just couldn't handle the IOs. I must just say that at this time, we were not using dovecot.

Then we moved to a completely different structure: 24 storage machines (plain CentOS as NFS servers), 7 frontend (webmail through IMAP + POP3 server) and 5 MXs, and all front end machines running dovecot. That was a major change in the system performances, but not happy yet with the 50T total storage we had. Having huge traffic between front end machine and storage, and at this time, I was not sure the switches were handling the load properly. Not talking about the load on the front end machine which some times needed a hard reboot to recover from NFS timeouts. Even after trying some heavy optimizations all around, and particularly on NFS.

Then we did look at the Dovecot director, but not sure how it would handle 1M users, we moved to the proxy solution: we are now running dovecot on the 24 storage machines, our webmail system connecting with IMAP to the final storage machine, as well as the MXs with LMTP, we only use dovecot proxy for the POP3 access on the 7 front end machines. And I must say, what a change. Since then the system is running smoothly, no more worries about NFS timeouts and the loadavg on all machine is down to almost nothing, as well as the internal traffic on the switches and our stress. And most important, the feed back from our users told us that we did the right thing.

Only trouble: now and then we have to move users around, as if a machine gets full, the only solution is to move data to one that has more space. But this is achieved easily with the dsync tool.

This is just my experience, it might not be the best, but with the (limited) budget we had, we finally came up with a solutions that can handle the load and got us away from SAN systems which could never handle the IOs for mail access. Just for the sake of it, our storage machines only have each 4 x 1T SATA drives in RAID 10, and 16G of mem, which I've been told would never do the job, but it just works. Thanks Timo.

Hoping this will help in your decision,

Regards,

Thierry

On 24 Mar 2013, at 18:12, Tigran Petrosyan<tpetrosy@gmail.com> wrote:

...
Hi We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.

4487

Age (days ago)

4493

Last active (days ago)

List overview

9 comments

8 participants

participants (8)

Ed W
list＠airstreamcomm.net
Noel Butler
Stan Hoeppner
Thierry de Montaudry
Tigran Petrosyan
Timo Sirainen
Urban Loesch