[Dovecot] So, what about clustering and load balancing?

Sun Feb 14 09:16:34 EET 2010

Eric Rostetter put forth on 2/13/2010 11:02 PM:

> Which is the same folder/file the imap reads.  So you have the MTA
> delivering
> to the folder/file, and the dovecot server accessing the same, and hence
> you have lock contention.

Noo, that's not really correct.  Ceation, reads, and writes are at the file
level, not directory level.  TTBOMK none of the MTAs or Devocot, or any
application for that matter, lock an entire directory just to write a file.  If
locking an entire directory is even possible I've never heard of such a thing.

If you use mbox files and local delivery via your MTA, yes, you can get write
lock contention on the /var/mail/%uname file when new mail arrives at the same
time dovecot is removing a message from the file that the user has moved to an
imap folder.  I've never used dovecot deliver so I can't say if it is any
different in this regard.

Ddovecot usually only deletes-messages-from/compacts the /var/mail/%uname file
when a user logs off, so the probability is very low for this write lock
contention scenario.  More common would be a read contention when imap attempts
to read /var/mail/%uname and the MTA has this file locked for write access due
to new mail delivery.  In this case, there is absolutely no difference between a
standalone host or clustered hosts, because only two processes are trying to
lock a single file and thus the lock contention only affects one user.  The user
will see no negative impact anyway.  All s/he will see is a new email in the
inbox.  S/he has no idea it was delayed a few milliseconds due to a file lock.
If the MUA is set to check for new mail every x seconds, the potential for
/var/mail/%user lock contention is once every 60 seconds, unless a user has
multiple MUAs or webmail hitting the same account.

If you use maildir format mailboxen, you physically can't run into a write lock
contention between the MTA and the imap process because the MTA writes every new
email to a new file name.  The imap server doesn't know it is there until it is
created.  The MTA never writes the same file name twice, so the potential for
lock contention is 0.

See: http://www.qmail.org/qmail-manual-html/man5/maildir.html

This is one of the reasons maildir format is popular, especially for clusters
using shared NFS storage.

> If the MTA is delivering to an mbox file, and dovecot is reading/writing
> to the same mbox file, and the MTA and dovecot are on different machines,
> then you DO have lock contention.  For maildir, less so (to the point
> it is negligible and not worth talking about really).

You have the same lock contention if both the MTA and dovecot are on the same
host.  The only difference is that for the clustered case, notification of the
lock takes longer as this communication has to take place over ethernet instead
of shared memory, although one could use infiniband or myrinet, but that would
be overkill (speed and cost) for clustered mail/imap systems.

See my above comment regarding how infrequently dovecot makes write locks to the
/var/mail/%user new mail file.  Dovecot typically performs no writes to this
file until the user logs off.  If you were to put a counter on locks to a single
user's /var/mail/%user mailbox file you'd be surprised how few lock contentions
actually occur.  Remember these locks are per user/process/per file.

Agreed WRT maildir.  See my comments above.

>>> Also there may be other things to cause lock contention like backups,
>>> admin cron jobs to check things, etc.
>>
>> You have all these things with a non clustered filesystem and have to
>> deal with
>> them there anyway, so there's really no difference is there?
> 
> No, to really...

Exactly.  The only difference here is that the processes that create lock
contention run on multiple hosts in a cluster setup.  You could run the same
workload one one fat SMP and the number of locks would be the same.  The fact
that one is using a cluster doesn't in itself generate (more) lock contention on
files.  It's the workload that generates the file locks.  Imap with maildir is
extremely amenable to a clustered filesystem because locks are pretty much non
existent.

> See the archives for details...  Basically a 3 node cluster doing GFS
> over DRDB on 2 nodes in an active-active setup..  Those two nodes do MTA
> and

Aha!  Ok, now I see why we're looking at this from slightly different
perspectives.  DRDB has extremely high overhead compared to a real clustered
filesystem solution with SAN storage.

> dovecot (with DRBD+GFS), the 3rd does webmail only (without DRBD/GFS).
> They hold only the INBOX mbox files, not the other folders which are
> in the user's home directory.  Home directories are stored on a separate
> 2-node cluster running DRBD in an active-passive setup using ext3.
> The mail servers are front-ended by a 2-node active-passive cluster
> (shared nothing) which directs all MTA/dovecot/httpd/etc traffic.  I
> use perdition to do dovecot routing via LDAP, the MTA can also do routing
> via LDAP, and I use pound to do httpd routing.
> 
> Right now it is just a few nodes, but it can grow if needed to really any
> number of nodes (using GNBD to scale the GFS where needed, and the
> MTA/Dovecot/httpd routing already in place).  Right now, it rocks, and
> the only thing we plan to scale out is the webmail part (which is easy
> as it doesn't involve any drbd/gfs/mta/dovecot, just httpd and sql).

My $deity that is an unnecessarily complicated HA setup.  You could have an FC
switch, a 14 x 10K rpm 300GB SAN array and FC HBAs for all your hosts for about
$20K or a little less.  Make it less than $15K for an equivalent iSCSI setup.
With this setup any cluster host would only write to disk once and all machines
have it "locally".  No need to replicate disk blocks.  Add 6 more nodes, connect
them to the shared filesystem LUN on the array, no need to replicate data.  Just
mount the LUN filesystem in the appropriate path location on each new server,
and "it's just there".

Your setup sounds ok for 2 host active/active failover, but how much will your
"dedicated I/O block duplication network" traffic increase if you add 2 more
nodes?  Hint:  it's not additive, it's multiplicative.  Your 2 DRBD duplication
streams turn into 12 streams going from 2 cluster hosts to 4.  Add in a pair of
GNBD servers and you're up to 20 replication streams, if my math is correct, and
that's assuming the 4 cluster servers hit one GNBD server and not the other.
Assuming the two GNBD servers would replicate to one another, it would be
redundant to replicate the 4 cluster hosts to the 2nd GNBD server.

An inexpensive SAN will outperform this setup by leaps and bounds, and eliminate
a boat load of complexity.  You might want to look into it.

-- 
Stan