On Wed, Sep 26, 2012 at 08:57:58PM +0300, Timo Sirainen wrote:
On 26.9.2012, at 20.34, Kelsey Cummings wrote:
The following errors on the directors that started after this went unnoticed until this AM.
director: User bb host lookup failed: Timeout - queued for 30 secs (Ring synced for 36 secs) director: User cc host lookup failed: Timeout - queued for 48 secs (Ring synced for 66 secs, user refreshed 12 secs ago) director: User dd host lookup failed: Timeout - queued for 124 secs (Ring synced for 119 secs, weak user, user refreshed 155 secs ago) director: User ee host lookup failed: Timeout - queued for 79 secs (Ring synced for 119 secs, weak user, user refreshed 113 secs ago) ... User ff host lookup failed: Timeout - queued for 30 secs (Ring synced for 7427 secs, weak user, user refreshed 620 secs ago)
This continued, combined with occasional login timeouts (as reported by some internal imap clients.) The login delays/timeouts got bad enough that our load balancers dropped both the servers while I was investigating. They seem to be okay after being restarted.
After the first few minutes, did all the rest of the error messages contain "weak user" string? Did this happen to a lot of different users (few/some/most)? director_user_expire setting is the default 15 minutes?
No, there continued to be a mix of both. The pattern seems to look like this. I'll run some stats later but it looks like a pretty significant number of users where affected.
09:25:21 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5032 secs) 09:25:55 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5066 secs, weak user, user refreshed 64 secs ago) 09:26:28 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5099 secs, weak user, user refreshed 97 secs ago)
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407