[Dovecot] So, what about clustering and load balancing?
Hello world,
so, with beta2 of Dovecot 2.0 being available, what is the preferred way to achieve load balancing and fault tolerance? As far as I can see it, there are basically two options:
Use a HA shared storage, export either a cluster filesystem or NFS, and have your dovecot servers mount that file system. Load balance these servers (Cisco ACE, ldirectord, ...) and there you go. Overall performance is limited by the speed of the shared storage and may further decrease due to locking issues, imperformant cluster filesystems and so on.
Set up every dovecot server with local storage and trigger a dsync after each change (pyinotfiy, incron, parse LMTP delivery logs, ...). Load balance these servers as before. Depending on the number of syncs that have to be done, mailbox replication may lag behind. The need to constantly spawn dsync processes may further decrease performance if you don't do it right (stickiness of established connections, replicating at most every xyz seconds).
What are your opinions on that matter?
Stefan
P.S: I've set up option two in a test setup, though the incron/inotify part is still giving me a headache.
On 13.2.2010, at 17.48, Stefan Foerster wrote:
- Use a HA shared storage, export either a cluster filesystem or NFS, and have your dovecot servers mount that file system. Load balance these servers (Cisco ACE, ldirectord, ...) and there you go.
With NFS if you want to avoid random errors, you still need to do the load balancing in a way that user's mails are never accessed simultaneously from two servers at the same time. See recent "quick question" thread about this. Cluster filesystems should be able to handle this better, although for performance it's probably still a good idea to do this.
- Set up every dovecot server with local storage and trigger a dsync after each change (pyinotfiy, incron, parse LMTP delivery logs, ...). Load balance these servers as before. Depending on the number of syncs that have to be done, mailbox replication may lag behind. The need to constantly spawn dsync processes may further decrease performance if you don't do it right (stickiness of established connections, replicating at most every xyz seconds).
The above solution would require some new development to create such a daemon that keeps track of changes and dsyncs users when necessary. The fastest way would probably be to create a plugin that uses notify plugin's events and sends them to the daemon.
dsync process creation could be avoided by making dsync another service that listens on unix socket. Wouldn't be a big change. v2.0 makes this really easy :)
Also 3) Use DRBD server pairs.
Timo Sirainen put forth on 2/13/2010 11:40 AM:
On 13.2.2010, at 17.48, Stefan Foerster wrote:
- Use a HA shared storage, export either a cluster filesystem or NFS, and have your dovecot servers mount that file system. Load balance these servers (Cisco ACE, ldirectord, ...) and there you go.
With NFS if you want to avoid random errors, you still need to do the load balancing in a way that user's mails are never accessed simultaneously from two servers at the same time. See recent "quick question" thread about this. Cluster filesystems should be able to handle this better, although for performance it's probably still a good idea to do this.
Timo, are you saying cluster filesystems aren't suitable for clustered dovecot storage due to performance reasons?
I've managed a 500 user environment where *EVERYTHING* was hitting on two 2Gb/s FC SAN storage arrays, one IBM and one Nexsan, through a single Qlogic FC switch, and these were fairly low end FC arrays. We had 4 Citrix blades, 3 ESX blades, and an Exchange blade. On the ESX blades we were running two AD DCs plus windows CIFS serving, A LAMP Moodle database server, a dedicated Linux syslog server, a MS SQL server for Citrix licensing and the organization accounting application, a Citrix balancing director server, Novel Zen server, Novell iFolder server, a Linux print server, etc, etc.
All the ESX VM guest storage resided on the SAN and thus they all booted from SAN. All bare metal blades booted from SAN as well. The only blades with onboard disks were the ESX blades which booted the VMware kernel from mirrored local disks. Now, granted none of these systems were sharing the same filesystem on exported LUNs and sharing metadata over ethernet. But the overall load on the SAN controllers was fairly high.
I can't see how the metadata sharing of say GFS2 is going to create any serious performance impact on a cluster of dovecot servers using GFS2 and a shared SAN array especially if using maildir. If the load balancing is implemented correctly and a given user is only hitting one dovecot server at any one point in time, there should be few, if any, shared file locks. Thus, no negative impact due to shared locking.
A single one of these: http://www.nexsan.com/sataboy/tech.php
configured with 2GB of cache and 14 x 300GB 10K rpm SATA drives setup at as a single RAID 10 would yield ~2.1TB of fast redundant mail storage and ~370MB/s of (controller limited) read throughput using a single controller. Double that figure if you go with active-active dual controllers. Sustained IOPs with 14x 10K rpm drives is about 30K IOPs, which is not bad at all considering this is the lowest end array Nexsan sells. Double that to 60K IOPs for dual controllers.
I'd bet I could take this single low end Nexsan array, a low end 8 port Qlogic 4Gb FC switch, and 6 average dual socket servers with 8GB of RAM each and a single port 4Gb FC adapter, a decent gigabit ethernet switch, using Linux with GFS2 and Dovecot, and build a cluster that could easily handle a few hundred to a few thousand concurrent IMAP users and 30K+ total mailiboxen assuming a 500MB mailbox limit per user. This hardware configuration should be attainable for well less than $50K USD. This doesn't included the front end load balancer.
-- Stan
On 14.2.2010, at 3.31, Stan Hoeppner wrote:
With NFS if you want to avoid random errors, you still need to do the load balancing in a way that user's mails are never accessed simultaneously from two servers at the same time. See recent "quick question" thread about this. Cluster filesystems should be able to handle this better, although for performance it's probably still a good idea to do this.
I can't see how the metadata sharing of say GFS2 is going to create any serious performance impact on a cluster of dovecot servers using GFS2 and a shared SAN array especially if using maildir. If the load balancing is implemented correctly and a given user is only hitting one dovecot server at any one point in time, there should be few, if any, shared file locks. Thus, no negative impact due to shared locking.
I think that's what I said above, or at least tried to.. Well, looks like i missed a "not" there: "good idea not to do this".
On 14.2.2010, at 3.31, Stan Hoeppner wrote:
I can't see how the metadata sharing of say GFS2 is going to create
any serious performance impact on a cluster of dovecot servers using GFS2 and a
shared SAN array especially if using maildir.
While I use GFS shared on a cluster with dovecot for mbox and think the performance rocks, this really depends on the setup (mbox may be worse than maildir, how much hardware you have for it, how you load balance, etc) and what you consider good/bad performance (what is fast to me might be slow to you) as well as of course scale (might work for 2K users, but what about for 200K or 1.5M).
If the load balancing is implemented correctly and a given user is only hitting one dovecot server at
any one point in time, there should be few, if any, shared file locks. Thus, no negative impact due to shared locking.
This ignores the delivery of mail to the user (again, not so bad for maildir but a killer for mbox). If the delivery is on a separate box than dovecot your can have lock contention...
Also there may be other things to cause lock contention like backups, admin cron jobs to check things, etc.
Anyway, I run dovecot in a cluster without any issue, but that is because of my client base and performance expectations (and some real nice hardware).
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Eric Rostetter put forth on 2/13/2010 8:39 PM:
This ignores the delivery of mail to the user (again, not so bad for maildir but a killer for mbox). If the delivery is on a separate box than dovecot your can have lock contention...
You attach the inbound MTA to the FC switch, export the LUN with the GFS2 filesystem and drop new mail to the appropirate folder(s). The dovecot cluster machines pick it up just as if it were on a local filesystem. This can be done very easily with with mbox or maildir and there's no more potential for lock contention than the imap files.
Also there may be other things to cause lock contention like backups, admin cron jobs to check things, etc.
You have all these things with a non clustered filesystem and have to deal with them there anyway, so there's really no difference is there? Is this much different than an IBM P595 with 64 Power6 5 GHz cores, 1TB of RAM, and 100TB of FasTt FC disk arrays, running a local inbound MTA (postfix), tens of thousands of imap processes handling concurrent users for a few million imap mailboxen? The only difference is lock data travels the wire outside the box in a cluster setup instead of through shared memory with the big SMP. You'll still have locking contention during backup etc, and probably more of it given the scale of this example. GigE is plenty fast enough to carry the small extra load of the cluster fs metadata. In fact, if the load balancing is implemented well, there will be very little locking contention at all, even during backup.
Anyway, I run dovecot in a cluster without any issue, but that is because of my client base and performance expectations (and some real nice hardware).
Tell us more about your hardware setup if you don't mind. General info, a brief few lines would be fine: load balancer, number/config of server nodes, SAN switch, storage arrays, software setup. Also, when do you plan to move to GFS2?
Thanks.
-- Stan
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
Eric Rostetter put forth on 2/13/2010 8:39 PM:
This ignores the delivery of mail to the user (again, not so bad for maildir but a killer for mbox). If the delivery is on a separate box than dovecot your can have lock contention...
You attach the inbound MTA to the FC switch, export the LUN with the GFS2 filesystem and drop new mail to the appropirate folder(s).
Which is the same folder/file the imap reads. So you have the MTA delivering to the folder/file, and the dovecot server accessing the same, and hence you have lock contention.
The dovecot cluster machines pick it up just as if it were on a local filesystem. This
can be done very easily with with mbox or maildir and there's no more potential for lock contention than the imap files.
If the MTA is delivering to an mbox file, and dovecot is reading/writing to the same mbox file, and the MTA and dovecot are on different machines, then you DO have lock contention. For maildir, less so (to the point it is negligible and not worth talking about really).
Also there may be other things to cause lock contention like backups, admin cron jobs to check things, etc.
You have all these things with a non clustered filesystem and have
to deal with them there anyway, so there's really no difference is there?
No, to really...
Anyway, I run dovecot in a cluster without any issue, but that is because of my client base and performance expectations (and some real nice hardware).
Tell us more about your hardware setup if you don't mind.
See the archives for details... Basically a 3 node cluster doing GFS over DRDB on 2 nodes in an active-active setup.. Those two nodes do MTA and dovecot (with DRBD+GFS), the 3rd does webmail only (without DRBD/GFS). They hold only the INBOX mbox files, not the other folders which are in the user's home directory. Home directories are stored on a separate 2-node cluster running DRBD in an active-passive setup using ext3. The mail servers are front-ended by a 2-node active-passive cluster (shared nothing) which directs all MTA/dovecot/httpd/etc traffic. I use perdition to do dovecot routing via LDAP, the MTA can also do routing via LDAP, and I use pound to do httpd routing.
Right now it is just a few nodes, but it can grow if needed to really any number of nodes (using GNBD to scale the GFS where needed, and the MTA/Dovecot/httpd routing already in place). Right now, it rocks, and the only thing we plan to scale out is the webmail part (which is easy as it doesn't involve any drbd/gfs/mta/dovecot, just httpd and sql).
Also, when do you plan to move to GFS2?
Sorry, it is GFS2... Has been since I set it up... In fact, we delayed the project until GFS2 was available (I think it was like 6 months or something).
Thanks.
-- Stan
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Eric Rostetter put forth on 2/13/2010 11:02 PM:
Which is the same folder/file the imap reads. So you have the MTA delivering to the folder/file, and the dovecot server accessing the same, and hence you have lock contention.
Noo, that's not really correct. Ceation, reads, and writes are at the file level, not directory level. TTBOMK none of the MTAs or Devocot, or any application for that matter, lock an entire directory just to write a file. If locking an entire directory is even possible I've never heard of such a thing.
If you use mbox files and local delivery via your MTA, yes, you can get write lock contention on the /var/mail/%uname file when new mail arrives at the same time dovecot is removing a message from the file that the user has moved to an imap folder. I've never used dovecot deliver so I can't say if it is any different in this regard.
Ddovecot usually only deletes-messages-from/compacts the /var/mail/%uname file when a user logs off, so the probability is very low for this write lock contention scenario. More common would be a read contention when imap attempts to read /var/mail/%uname and the MTA has this file locked for write access due to new mail delivery. In this case, there is absolutely no difference between a standalone host or clustered hosts, because only two processes are trying to lock a single file and thus the lock contention only affects one user. The user will see no negative impact anyway. All s/he will see is a new email in the inbox. S/he has no idea it was delayed a few milliseconds due to a file lock. If the MUA is set to check for new mail every x seconds, the potential for /var/mail/%user lock contention is once every 60 seconds, unless a user has multiple MUAs or webmail hitting the same account.
If you use maildir format mailboxen, you physically can't run into a write lock contention between the MTA and the imap process because the MTA writes every new email to a new file name. The imap server doesn't know it is there until it is created. The MTA never writes the same file name twice, so the potential for lock contention is 0.
See: http://www.qmail.org/qmail-manual-html/man5/maildir.html
This is one of the reasons maildir format is popular, especially for clusters using shared NFS storage.
If the MTA is delivering to an mbox file, and dovecot is reading/writing to the same mbox file, and the MTA and dovecot are on different machines, then you DO have lock contention. For maildir, less so (to the point it is negligible and not worth talking about really).
You have the same lock contention if both the MTA and dovecot are on the same host. The only difference is that for the clustered case, notification of the lock takes longer as this communication has to take place over ethernet instead of shared memory, although one could use infiniband or myrinet, but that would be overkill (speed and cost) for clustered mail/imap systems.
See my above comment regarding how infrequently dovecot makes write locks to the /var/mail/%user new mail file. Dovecot typically performs no writes to this file until the user logs off. If you were to put a counter on locks to a single user's /var/mail/%user mailbox file you'd be surprised how few lock contentions actually occur. Remember these locks are per user/process/per file.
Agreed WRT maildir. See my comments above.
Also there may be other things to cause lock contention like backups, admin cron jobs to check things, etc.
You have all these things with a non clustered filesystem and have to deal with them there anyway, so there's really no difference is there?
No, to really...
Exactly. The only difference here is that the processes that create lock contention run on multiple hosts in a cluster setup. You could run the same workload one one fat SMP and the number of locks would be the same. The fact that one is using a cluster doesn't in itself generate (more) lock contention on files. It's the workload that generates the file locks. Imap with maildir is extremely amenable to a clustered filesystem because locks are pretty much non existent.
See the archives for details... Basically a 3 node cluster doing GFS over DRDB on 2 nodes in an active-active setup.. Those two nodes do MTA and
Aha! Ok, now I see why we're looking at this from slightly different perspectives. DRDB has extremely high overhead compared to a real clustered filesystem solution with SAN storage.
dovecot (with DRBD+GFS), the 3rd does webmail only (without DRBD/GFS). They hold only the INBOX mbox files, not the other folders which are in the user's home directory. Home directories are stored on a separate 2-node cluster running DRBD in an active-passive setup using ext3. The mail servers are front-ended by a 2-node active-passive cluster (shared nothing) which directs all MTA/dovecot/httpd/etc traffic. I use perdition to do dovecot routing via LDAP, the MTA can also do routing via LDAP, and I use pound to do httpd routing.
Right now it is just a few nodes, but it can grow if needed to really any number of nodes (using GNBD to scale the GFS where needed, and the MTA/Dovecot/httpd routing already in place). Right now, it rocks, and the only thing we plan to scale out is the webmail part (which is easy as it doesn't involve any drbd/gfs/mta/dovecot, just httpd and sql).
My $deity that is an unnecessarily complicated HA setup. You could have an FC switch, a 14 x 10K rpm 300GB SAN array and FC HBAs for all your hosts for about $20K or a little less. Make it less than $15K for an equivalent iSCSI setup. With this setup any cluster host would only write to disk once and all machines have it "locally". No need to replicate disk blocks. Add 6 more nodes, connect them to the shared filesystem LUN on the array, no need to replicate data. Just mount the LUN filesystem in the appropriate path location on each new server, and "it's just there".
Your setup sounds ok for 2 host active/active failover, but how much will your "dedicated I/O block duplication network" traffic increase if you add 2 more nodes? Hint: it's not additive, it's multiplicative. Your 2 DRBD duplication streams turn into 12 streams going from 2 cluster hosts to 4. Add in a pair of GNBD servers and you're up to 20 replication streams, if my math is correct, and that's assuming the 4 cluster servers hit one GNBD server and not the other. Assuming the two GNBD servers would replicate to one another, it would be redundant to replicate the 4 cluster hosts to the 2nd GNBD server.
An inexpensive SAN will outperform this setup by leaps and bounds, and eliminate a boat load of complexity. You might want to look into it.
-- Stan
On Sun, 2010-02-14 at 01:16 -0600, Stan Hoeppner wrote:
If you use maildir format mailboxen, you physically can't run into a write lock contention between the MTA and the imap process because the MTA writes every new email to a new file name.
You're ignoring dovecot-uidlist and index files.
Timo Sirainen put forth on 2/14/2010 1:31 PM:
On Sun, 2010-02-14 at 01:16 -0600, Stan Hoeppner wrote:
If you use maildir format mailboxen, you physically can't run into a write lock contention between the MTA and the imap process because the MTA writes every new email to a new file name.
You're ignoring dovecot-uidlist and index files.
Apparently I'm missing something. If the MTA is creating files per Maildir specs, I fail to see how dovecot imap is going to be reading or writing one of these new files before it's actually fully written. I can't see a scenario where there could be lock contention. The MTA writes the new message files and never touches them again. Does Dovecot try to read them before they're actually committed?
Is dovecot trying to open a new maildir file after the MTA calls stat() but before link()? If dovecot doesn't know of the existence of the file before the link() operation commits the file to the filesystem, then I can't see how there can ever be lock contention with maildir files.
Can you please explain Timo, so I understand my error here?
Thanks.
-- Stan
On 15.2.2010, at 8.13, Stan Hoeppner wrote:
Timo Sirainen put forth on 2/14/2010 1:31 PM:
On Sun, 2010-02-14 at 01:16 -0600, Stan Hoeppner wrote:
If you use maildir format mailboxen, you physically can't run into a write lock contention between the MTA and the imap process because the MTA writes every new email to a new file name.
You're ignoring dovecot-uidlist and index files.
Apparently I'm missing something. If the MTA is creating files per Maildir specs,
If the mails are delivered by MTA or something else than Dovecot delivery agent, then there shouldn't be any locking contention. But normally using Dovecot deliver should give better performance, and that (reads and) writes dovecot-uidlist and dovecot.index* files, which IMAP/POP3 also reads/writes.
Also I guess it depends on internals, but I'd think creating new files requires some kind of locking/synchronization for the directory, which is similar to locking contention (it can't respond success to file creation until it's verified that another server hadn't already created it).
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
Eric Rostetter put forth on 2/13/2010 11:02 PM:
I'm bowing out of this discussion, as I was using words in a non-precise way, and it is clear that Stan is using them in a very precise way, and hence we're not really discussing the same thing...
My fault for not thinking/writing in precise, technical terms... I was basically introducing things like i/o contention and bandwidth issues into his thread which was solely on actual lock contention...
Suffice it to say: yes, there can be lock contention, but no, it isn't really any worse than a SMP machine... What I tend to think of informally as being lock contention issues are really i/o contention and bandwidth issues.
You have the same lock contention if both the MTA and dovecot are on the same host. The only difference is that for the clustered case,
notification of the lock takes longer [...]
Yeah, which is what I was thinking of as lock contention, but really it is about i/o or bandwidth, and not lock contention... I was just being real sloppy with my language...
My $deity that is an unnecessarily complicated HA setup. You could
have an FC switch, a 14 x 10K rpm 300GB SAN array and FC HBAs for all your
hosts for about $20K or a little less. Make it less than $15K for an equivalent iSCSI setup.
My setup, since it used some existing machines with only 4 new machines, was cheaper. And it supports iSCSI, I just choose not to do that for the email as my opinion (right or wrong) was that the DRBD+GFS local to the mail server would be better performance than iSCSI back to my SAN cluster.
An inexpensive SAN will outperform this setup by leaps and bounds,
and eliminate a boat load of complexity. You might want to look
into it.
Well, essentially I do have an inexpensive SAN (see the description), I just choose not to use it for the e-mail, which may or may not have been a good idea, but it was what I decided to do.
I could always switch the email to an iSCSI connection in the future without any problem other than some very limited downtime to copy the files off the local disks to the SAN machines... (Very limited as I can move the mbox files one by one when they are not in use, using the routing mentioned to send dovecot to the proper location... That's actually how I migrated to this system, and I only had an issue with 13 users who ran non-stop imap connections and I had to actually disconnect them... All the other users had times they were not connected and I could migrate them without them ever noticing... Took a while, but it went completely unnoticed by all but the last 13 users...)
-- Stan
-- Eric Rostetter The Department of Physics The University of Texas at Austin
This message is provided "AS IS" without warranty of any kind, either expressed or implied. Use this message at your own risk.
I would suggested AFS as shared storage, with user-specific DNS based load balancing. (user connects to 'username.imap.example.com') This probably falls over with huge shared mailboxes though.
If you give the IMAP servers sufficient AFS cache space, any data that needs to be read is sitting hot in the AFS cache, and all writes are duplicated back to the AFS server. The ability of afs to take snapshots of live volumes is quite nice for backups.. http://docs.openafs.org/AdminGuide/ch05s06.html
On Sat, Feb 13, 2010 at 04:48:07PM +0100, Stefan Foerster wrote:
Hello world,
so, with beta2 of Dovecot 2.0 being available, what is the preferred way to achieve load balancing and fault tolerance? As far as I can see it, there are basically two options:
Use a HA shared storage, export either a cluster filesystem or NFS, and have your dovecot servers mount that file system. Load balance these servers (Cisco ACE, ldirectord, ...) and there you go. Overall performance is limited by the speed of the shared storage and may further decrease due to locking issues, imperformant cluster filesystems and so on.
Set up every dovecot server with local storage and trigger a dsync after each change (pyinotfiy, incron, parse LMTP delivery logs, ...). Load balance these servers as before. Depending on the number of syncs that have to be done, mailbox replication may lag behind. The need to constantly spawn dsync processes may further decrease performance if you don't do it right (stickiness of established connections, replicating at most every xyz seconds).
What are your opinions on that matter?
Stefan
P.S: I've set up option two in a test setup, though the incron/inotify part is still giving me a headache.
--
Troy Benjegerdes 'da hozer' hozer@hozed.org
Unless hours were cups of sack, and minutes capons, and clocks the tongues of bawds, and dials the signs of leaping houses, and the blessed sun himself a fair, hot wench in flame-colored taffeta, I see no reason why thou shouldst be so superfluous to demand the time of the day. I wasted time and now doth time waste me. -- William Shakespeare
participants (6)
-
Eric Jon Rostetter
-
Eric Rostetter
-
Stan Hoeppner
-
Stefan Foerster
-
Timo Sirainen
-
Troy Benjegerdes