I'm working the kinks of a new director based setup for the eventual migration away from courier. At this point, with everything basically working I'm trying to ensure that things are properly monitored and I've run into an issue. There doesn't appear to be a way to get dovecot to tell if it is (or is not) connected and properly synced with the other director servers in the ring apart from the logs. It seems like this is an important piece of information -- without it, it isn't apparent how you would be able to tell if your director servers have lost track of each other.
I'm also curious what people are doing to health check their director servers when they are running load balancing upstream of them as well. It doesn't seem like it is a good idea to let the load balancers check all the way through to the real servers since a failure on the target real server could end up leading to a director being dropped from the pool (if so, it is most likely that the other directors would be dropped as well.) Otherwise, the health check failure tolerance at the load balancer must be greater than the tolerance for failure of the real servers on the director- a dead director could end up in the pool for longer than desired, or anyway, long enough to be sure that it isn't a transient failure on the real server behind it.
A better method would seem to be for the load balancers to query the director for the number of active back-end servers and, so long as it was over a given threshold, to assume that the director is otherwise able to do its job and rely on external monitoring to pickup internal failures where dovecot isn't able to successfully proxy the connection to one of the real servers.
So, how are people doing this in the real world?
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407