11 Aug
                
                    2014
                
            
            
                11 Aug
                
                '14
                
            
            
            
        
    
                6:59 p.m.
            
        I've been planning to improve poolmon failure checking for a long time already, but I still haven't managed to get to it. Maybe somebody else has more time, so here's a feature request for anyone to implement:
poolmon currently gives up immediately if the first check to any service fails. It really should be trying multiple times over multiple seconds before giving up. I think ideally it would be:
- Individual check timeout could still be the default 5 seconds
 - Add full check time setting, which could be e.g. 15 seconds. If all checks fail during this time then disable the host.
 - If request fails because connection gets rejected, retry quickly, e.g. after 0,1 seconds
 - If check fails because of protocol errors, wait for a long time, e.g. 1 second
 
So this would avoid backend being removed in situations where it really shouldn't be removed:
- Dovecot restarts
 - Dovecot reloads
 - load spikes and other random issues that cause temporary problems
 
Especially the load spike is an annoying issue which my plan doesn't even fully solve. The solution to fix a heavily overloaded cluster isn't really to start removing all of its backends that are busy working..