Eric Rostetter put forth on 2/13/2010 11:02 PM:
Which is the same folder/file the imap reads. So you have the MTA delivering to the folder/file, and the dovecot server accessing the same, and hence you have lock contention.
Noo, that's not really correct. Ceation, reads, and writes are at the file level, not directory level. TTBOMK none of the MTAs or Devocot, or any application for that matter, lock an entire directory just to write a file. If locking an entire directory is even possible I've never heard of such a thing.
If you use mbox files and local delivery via your MTA, yes, you can get write lock contention on the /var/mail/%uname file when new mail arrives at the same time dovecot is removing a message from the file that the user has moved to an imap folder. I've never used dovecot deliver so I can't say if it is any different in this regard.
Ddovecot usually only deletes-messages-from/compacts the /var/mail/%uname file when a user logs off, so the probability is very low for this write lock contention scenario. More common would be a read contention when imap attempts to read /var/mail/%uname and the MTA has this file locked for write access due to new mail delivery. In this case, there is absolutely no difference between a standalone host or clustered hosts, because only two processes are trying to lock a single file and thus the lock contention only affects one user. The user will see no negative impact anyway. All s/he will see is a new email in the inbox. S/he has no idea it was delayed a few milliseconds due to a file lock. If the MUA is set to check for new mail every x seconds, the potential for /var/mail/%user lock contention is once every 60 seconds, unless a user has multiple MUAs or webmail hitting the same account.
If you use maildir format mailboxen, you physically can't run into a write lock contention between the MTA and the imap process because the MTA writes every new email to a new file name. The imap server doesn't know it is there until it is created. The MTA never writes the same file name twice, so the potential for lock contention is 0.
See: http://www.qmail.org/qmail-manual-html/man5/maildir.html
This is one of the reasons maildir format is popular, especially for clusters using shared NFS storage.
If the MTA is delivering to an mbox file, and dovecot is reading/writing to the same mbox file, and the MTA and dovecot are on different machines, then you DO have lock contention. For maildir, less so (to the point it is negligible and not worth talking about really).
You have the same lock contention if both the MTA and dovecot are on the same host. The only difference is that for the clustered case, notification of the lock takes longer as this communication has to take place over ethernet instead of shared memory, although one could use infiniband or myrinet, but that would be overkill (speed and cost) for clustered mail/imap systems.
See my above comment regarding how infrequently dovecot makes write locks to the /var/mail/%user new mail file. Dovecot typically performs no writes to this file until the user logs off. If you were to put a counter on locks to a single user's /var/mail/%user mailbox file you'd be surprised how few lock contentions actually occur. Remember these locks are per user/process/per file.
Agreed WRT maildir. See my comments above.
Also there may be other things to cause lock contention like backups, admin cron jobs to check things, etc.
You have all these things with a non clustered filesystem and have to deal with them there anyway, so there's really no difference is there?
No, to really...
Exactly. The only difference here is that the processes that create lock contention run on multiple hosts in a cluster setup. You could run the same workload one one fat SMP and the number of locks would be the same. The fact that one is using a cluster doesn't in itself generate (more) lock contention on files. It's the workload that generates the file locks. Imap with maildir is extremely amenable to a clustered filesystem because locks are pretty much non existent.
See the archives for details... Basically a 3 node cluster doing GFS over DRDB on 2 nodes in an active-active setup.. Those two nodes do MTA and
Aha! Ok, now I see why we're looking at this from slightly different perspectives. DRDB has extremely high overhead compared to a real clustered filesystem solution with SAN storage.
dovecot (with DRBD+GFS), the 3rd does webmail only (without DRBD/GFS). They hold only the INBOX mbox files, not the other folders which are in the user's home directory. Home directories are stored on a separate 2-node cluster running DRBD in an active-passive setup using ext3. The mail servers are front-ended by a 2-node active-passive cluster (shared nothing) which directs all MTA/dovecot/httpd/etc traffic. I use perdition to do dovecot routing via LDAP, the MTA can also do routing via LDAP, and I use pound to do httpd routing.
Right now it is just a few nodes, but it can grow if needed to really any number of nodes (using GNBD to scale the GFS where needed, and the MTA/Dovecot/httpd routing already in place). Right now, it rocks, and the only thing we plan to scale out is the webmail part (which is easy as it doesn't involve any drbd/gfs/mta/dovecot, just httpd and sql).
My $deity that is an unnecessarily complicated HA setup. You could have an FC switch, a 14 x 10K rpm 300GB SAN array and FC HBAs for all your hosts for about $20K or a little less. Make it less than $15K for an equivalent iSCSI setup. With this setup any cluster host would only write to disk once and all machines have it "locally". No need to replicate disk blocks. Add 6 more nodes, connect them to the shared filesystem LUN on the array, no need to replicate data. Just mount the LUN filesystem in the appropriate path location on each new server, and "it's just there".
Your setup sounds ok for 2 host active/active failover, but how much will your "dedicated I/O block duplication network" traffic increase if you add 2 more nodes? Hint: it's not additive, it's multiplicative. Your 2 DRBD duplication streams turn into 12 streams going from 2 cluster hosts to 4. Add in a pair of GNBD servers and you're up to 20 replication streams, if my math is correct, and that's assuming the 4 cluster servers hit one GNBD server and not the other. Assuming the two GNBD servers would replicate to one another, it would be redundant to replicate the 4 cluster hosts to the 2nd GNBD server.
An inexpensive SAN will outperform this setup by leaps and bounds, and eliminate a boat load of complexity. You might want to look into it.
-- Stan