poolmon improvements

Timo Sirainen tss at iki.fi
Mon Aug 11 16:59:19 UTC 2014


I've been planning to improve poolmon failure checking for a long time already, but I still haven't managed to get to it. Maybe somebody else has more time, so here's a feature request for anyone to implement:

poolmon currently gives up immediately if the first check to any service fails. It really should be trying multiple times over multiple seconds before giving up. I think ideally it would be:

 - Individual check timeout could still be the default 5 seconds
 - Add full check time setting, which could be e.g. 15 seconds. If all checks fail during this time then disable the host.
 - If request fails because connection gets rejected, retry quickly, e.g. after 0,1 seconds
 - If check fails because of protocol errors, wait for a long time, e.g. 1 second

So this would avoid backend being removed in situations where it really shouldn't be removed:

 - Dovecot restarts
 - Dovecot reloads
 - load spikes and other random issues that cause temporary problems

Especially the load spike is an annoying issue which my plan doesn't even fully solve. The solution to fix a heavily overloaded cluster isn't really to start removing all of its backends that are busy working..



More information about the dovecot mailing list