[Dovecot] director monitoring?
I'm working the kinks of a new director based setup for the eventual migration away from courier. At this point, with everything basically working I'm trying to ensure that things are properly monitored and I've run into an issue. There doesn't appear to be a way to get dovecot to tell if it is (or is not) connected and properly synced with the other director servers in the ring apart from the logs. It seems like this is an important piece of information -- without it, it isn't apparent how you would be able to tell if your director servers have lost track of each other.
I'm also curious what people are doing to health check their director servers when they are running load balancing upstream of them as well. It doesn't seem like it is a good idea to let the load balancers check all the way through to the real servers since a failure on the target real server could end up leading to a director being dropped from the pool (if so, it is most likely that the other directors would be dropped as well.) Otherwise, the health check failure tolerance at the load balancer must be greater than the tolerance for failure of the real servers on the director- a dead director could end up in the pool for longer than desired, or anyway, long enough to be sure that it isn't a transient failure on the real server behind it.
A better method would seem to be for the load balancers to query the director for the number of active back-end servers and, so long as it was over a given threshold, to assume that the director is otherwise able to do its job and rely on external monitoring to pickup internal failures where dovecot isn't able to successfully proxy the connection to one of the real servers.
So, how are people doing this in the real world?
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
We use a setup as seen on http://grab.by/agCb for about 30.000 simultaneous(!) imap connections.
We have 2 Foundry loadbalancers. They check the health of the directors. We have 3 directors, and each one runs Brandon's poolmon script (https://github.com/brandond/poolmon). This script removes real servers out of the director pool. The dovecot imap servers are monitored with nagios just to tell us when they're down.
This setup has been absolutely rock solid for us. I have not touched the whole system since november and we have not seen any more corruption of meta data, which is the whole reason for the directors. Kudos to Timo for fixing this difficult problem.
Cor
On Thu, Jun 02, 2011 at 10:37:23AM +0200, Cor Bosman wrote:
We use a setup as seen on http://grab.by/agCb for about 30.000 simultaneous(!) imap connections.
This might as well be a diagram of my network, although, if I remember, you're running quite a few more netapps clusters than I am. ;)
We have 2 Foundry loadbalancers. They check the health of the directors. We have 3 directors, and each one runs Brandon's poolmon script (https://github.com/brandond/poolmon). This script removes real servers out of the director pool. The dovecot imap servers are monitored with nagios just to tell us when they're down.
I'm using a hacked up version of poolmon. The only important changes are that it actually logs into the real server rather than just making a connection to it and that has heuristics to prevent the real servers from flapping and added a timeout to scan_host so if a real server blocks after the connection is established it won't hang indefinitely.
This setup has been absolutely rock solid for us. I have not touched the whole system since november and we have not seen any more corruption of meta data, which is the whole reason for the directors. Kudos to Timo for fixing this difficult problem.
That is always good to hear!
I'd be a lot happier if I was able to monitor the directors and make sure that they were connected and correctly synced with eachother - even as a protection from human error rather than anticipated software failure.
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
On Thu, Jun 02, 2011 at 12:29:10PM -0700, Kelsey Cummings wrote:
I'm using a hacked up version of poolmon. The only important changes are that it actually logs into the real server rather than just making a connection to it and that has heuristics to prevent the real servers from flapping and added a timeout to scan_host so if a real server blocks after the connection is established it won't hang indefinitely.
Could you share your hacks ? :-)
We're often seeing poolmon not noticing when our backend servers are hanging on busy filesystem. They're probably to busy to complete a login, but not busy enough to fail a connect, so a poolmon that does a full login sounds interesting.
-jf
On Fri, Aug 05, 2011 at 11:12:03AM +0200, Jan-Frode Myklebust wrote:
On Thu, Jun 02, 2011 at 12:29:10PM -0700, Kelsey Cummings wrote:
I'm using a hacked up version of poolmon. The only important changes are that it actually logs into the real server rather than just making a connection to it and that has heuristics to prevent the real servers from flapping and added a timeout to scan_host so if a real server blocks after the connection is established it won't hang indefinitely.
Could you share your hacks ? :-)
Sure. You'll probably want to change the regex at line 194 to match whatever your server says after the login is complete. My postlogin script puts out some extra info that I'm looking for instead of the deafult. Otherwise, YMMV, works for me so far.
http://kgc.users.sonic.net/imapdmon
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
On Thu, Jun 02, 2011 at 10:37:23AM +0200, Cor Bosman wrote:
We use a setup as seen on http://grab.by/agCb for about 30.000 simultaneous(!) imap connections.
Are you doing NFS against the Netapp(s)? I've always assumed that maildir wouldn't work on NFS (to slow fstat's), but would be interested to learn otherwise. Could you say something about how many email accounts and how many files you have in your maildirs ?
-jf
participants (3)
-
Cor Bosman
-
Jan-Frode Myklebust
-
Kelsey Cummings