poolmon improvements
Timo Sirainen
tss at iki.fi
Mon Aug 11 16:59:19 UTC 2014
I've been planning to improve poolmon failure checking for a long time already, but I still haven't managed to get to it. Maybe somebody else has more time, so here's a feature request for anyone to implement:
poolmon currently gives up immediately if the first check to any service fails. It really should be trying multiple times over multiple seconds before giving up. I think ideally it would be:
- Individual check timeout could still be the default 5 seconds
- Add full check time setting, which could be e.g. 15 seconds. If all checks fail during this time then disable the host.
- If request fails because connection gets rejected, retry quickly, e.g. after 0,1 seconds
- If check fails because of protocol errors, wait for a long time, e.g. 1 second
So this would avoid backend being removed in situations where it really shouldn't be removed:
- Dovecot restarts
- Dovecot reloads
- load spikes and other random issues that cause temporary problems
Especially the load spike is an annoying issue which my plan doesn't even fully solve. The solution to fix a heavily overloaded cluster isn't really to start removing all of its backends that are busy working..
More information about the dovecot
mailing list