poolmon improvements

11 Aug 2014

      I've been planning to improve poolmon failure checking for a long time already, but I still haven't managed to get to it. Maybe somebody else has more time, so here's a feature request for anyone to implement:
poolmon currently gives up immediately if the first check to any service fails. It really should be trying multiple times over multiple seconds before giving up. I think ideally it would be:

Individual check timeout could still be the default 5 seconds
Add full check time setting, which could be e.g. 15 seconds. If all checks fail during this time then disable the host.
If request fails because connection gets rejected, retry quickly, e.g. after 0,1 seconds
If check fails because of protocol errors, wait for a long time, e.g. 1 second

So this would avoid backend being removed in situations where it really shouldn't be removed:

Dovecot restarts
Dovecot reloads
load spikes and other random issues that cause temporary problems

Especially the load spike is an annoying issue which my plan doesn't even fully solve. The solution to fix a heavily overloaded cluster isn't really to start removing all of its backends that are busy working..

poolmon improvements

Timo Sirainen