David,
-----Original Message----- From: dovecot-bounces+brandond=uoregon.edu@dovecot.org [mailto:dovecot- Our physical setup is 10 Centos 5.4 x86_64 IMAP/POP servers, all with the same NFS backend where the index, control, and Maildir's for the users reside. Accessing this are direct connections from clients, plus multiple squirrelmail webservers, and pine users, all at the same time with layer4 switch connection load balancing.
Each server has an average of about 400 connections, for a total of around concurrent 4000 during a normal business day. This is out of a possible user population of about 15,000.
All our dovecot servers syslog to one machine, and on average I see about 50-75 instances of file corruption per day. I'm not counting each line, since some instances of corruption generate a log message for each uid that's wrong. This is just me counting "user A was corrupted once at 10:00, user B was corrupted at 10:25" for example.
We have a much similar setup - 8 POP/IMAP servers running RHEL 5.4, Dovecot 1.2.9 (+ patches), F5 BigIP load balancer cluster (active/standby) in a L4 profile distributing connections round-robin, maildirs on two Netapp Filers (clustered 3070s with 54k RPM SATA disks), 10k peak concurrent connections for 45k total accounts. We used to run with the noac mount option, but performance was abysmal, and we were approaching 80% CPU utilization on the filers at peak load. After removing noac, our CPU is down around 30%, and our NFS ops/sec rate is maybe 1/10th of what it used to be.
The downside to this is that we've started seeing significantly more crashing and mailbox corruption. Timo's latest patch seems to have fixed the crashing, but the corruption just seems to be the cost of distributing users at random across our backend servers.
We've thought about enabling IP-based session affinity on the load balancer, but this would concentrate the load of our webmail clients, as well as not really solving the problem for users that leave clients open on multiple systems. I've done a small bit of looking at nginx's imap proxy support, but it's not really set up to do what we want, and would require moving the IMAP virtual server off our load balancers and on to something significantly less supportable. Having the dovecot processes 'talk amongst themselves' to synchronize things, or go into proxy mode automatically, would be fantastic.
Anyway, that's where we're at with the issue. As a data point for your discussion with your boss:
- With 'noac', we would see maybe 1 or two 'corrupt' errors a day. Most of these were related to users going over quota.
- After removing 'noac', we saw 5-10 'Corrupt' errors and 20-30 crashes a day. The crashes were highly visible to the users, as their mailbox would appear to be empty until the rebuild completed.
- Since applying the latest patch, we've seen no crashes, and 60-70 'Corrupt' errors a day. We have not had any new user complaints.
Hope that helps,
-Brad