[Dovecot] Dovecot, LVS and the issues I have with it.

Mon Apr 6 14:42:55 EEST 2009

We run around 5 dovecot (debian etch 1.0.rc15) POP/IMAP 'nodes' using 
the LVS load balancer and an NFS based SAN. it works pretty well. I love 
the robustness of load balancing POP/IMAP.  We do a reasonable amount of 
throughput through these especially at peak times pushing our SAN to 
around 1.5k IOP/s

We currently have two issues with this setup. One of which is NFS index 
corruption issues we get due to NFS/dovecot locking. Basically the UUID 
list or a .index gets corrupt. This causes a full re-indexing of the 
mailbox / broken mailbox until i delete the indexes. In the UUID lists 
case the symptom tends to effect use who use POP rather than IMAP and 
insist on keeping messages on the server. Because it's corrupt it gets 
rebuilt one way or the other and the users email client proceeds to 
redownload the entire mailbox again until he remarks them to be saved. 
This tends to annoy the user a lot. After a bit of testing we do however 
expect this to be fixed by version 1.1. However if anyone has any 
comments on this I would certainly be interested.

The other issue is a little tricky to describe or even log effectively. 
Occasionally a node basically receives more connections than it is able 
to handle. The connection count goes through the roof for that node. It 
will go up way beyond 150 dovecot authentication threads and 100 or so 
active POP/IMAP dovecot threads. The IO_wait/cpu and memory usage also 
tends to be spiking at this point. The server gets to the tipping point 
where it can no longer serve it's POP/IMAP requests fast enough compared 
to the number of connections it's getting. I'd be fine with this but 
this creates some less than desirable symptoms:

1. We obviously reach the auth thread cap eventually so any new auth 
requests basically get refused by the server. To the user this is a seen 
as an outlook/mail client re-auth pop-up request. This annoys them. 
Ideally if the server stops accepting auth requests it should fall off 
our load balancer until it can consistently accept them again. Since the 
LVS detects a node fail by whether the tcp port is still open this 
doesn't happen, since dovecot keeps the port open. This is obviously 
more an LVS issue and not for this mailing list I expect unless anyone 
has any config tweaking tips here?

2. Now here's my real gripe. Dovecot does not handle running out 
resources very gracefully at all in our setup. It does start killing 
threads after a while. I get multiple *"dovecot: child 17989 (login) 
killed with signal 9". *I'm not exactly sure what's going on here 
because after this all I can see is the machine totally out of memory 
and the kernel starts killing absolutely everything. All services are 
killed (including ssh etc..) and I plug a monitor into the server and 
find the last few lines of the console listing init and other rather 
important things having just been killed. At this point it is a case of 
power cycling the server and all is back to normal again.

I imagine there's not a huge amount of people using dovecot in this way. 
But anyone got any recommendations here? I really like using dovecot in 
this setup it handles it pretty well and the redundancy and 
functionality options it provides have been invaluable.

Neil