[Dovecot] (new) director issues in 2.1.10
Timo - I upgraded to 2.1.10 on our director servers two nights ago and apart from errors associated with the directors processes restarting everything looked great for ~24 hours until I failed our the real servers last night to update the nfs mount options for the spools.
I followed the suggested procedure for each backend server, just run on one of the directors, which seemed to work as expected.
doveadm director add x.x.x.x 0 doveadm director flush x.x.x.x
The following errors on the directors that started after this went unnoticed until this AM.
director: User bb host lookup failed: Timeout - queued for 30 secs (Ring synced for 36 secs) director: User cc host lookup failed: Timeout - queued for 48 secs (Ring synced for 66 secs, user refreshed 12 secs ago) director: User dd host lookup failed: Timeout - queued for 124 secs (Ring synced for 119 secs, weak user, user refreshed 155 secs ago) director: User ee host lookup failed: Timeout - queued for 79 secs (Ring synced for 119 secs, weak user, user refreshed 113 secs ago) ... User ff host lookup failed: Timeout - queued for 30 secs (Ring synced for 7427 secs, weak user, user refreshed 620 secs ago)
This continued, combined with occasional login timeouts (as reported by some internal imap clients.) The login delays/timeouts got bad enough that our load balancers dropped both the servers while I was investigating. They seem to be okay after being restarted.
-K
On 26.9.2012, at 20.34, Kelsey Cummings wrote:
The following errors on the directors that started after this went unnoticed until this AM.
director: User bb host lookup failed: Timeout - queued for 30 secs (Ring synced for 36 secs) director: User cc host lookup failed: Timeout - queued for 48 secs (Ring synced for 66 secs, user refreshed 12 secs ago) director: User dd host lookup failed: Timeout - queued for 124 secs (Ring synced for 119 secs, weak user, user refreshed 155 secs ago) director: User ee host lookup failed: Timeout - queued for 79 secs (Ring synced for 119 secs, weak user, user refreshed 113 secs ago) ... User ff host lookup failed: Timeout - queued for 30 secs (Ring synced for 7427 secs, weak user, user refreshed 620 secs ago)
This continued, combined with occasional login timeouts (as reported by some internal imap clients.) The login delays/timeouts got bad enough that our load balancers dropped both the servers while I was investigating. They seem to be okay after being restarted.
After the first few minutes, did all the rest of the error messages contain "weak user" string? Did this happen to a lot of different users (few/some/most)? director_user_expire setting is the default 15 minutes?
On Wed, Sep 26, 2012 at 08:57:58PM +0300, Timo Sirainen wrote:
On 26.9.2012, at 20.34, Kelsey Cummings wrote:
The following errors on the directors that started after this went unnoticed until this AM.
director: User bb host lookup failed: Timeout - queued for 30 secs (Ring synced for 36 secs) director: User cc host lookup failed: Timeout - queued for 48 secs (Ring synced for 66 secs, user refreshed 12 secs ago) director: User dd host lookup failed: Timeout - queued for 124 secs (Ring synced for 119 secs, weak user, user refreshed 155 secs ago) director: User ee host lookup failed: Timeout - queued for 79 secs (Ring synced for 119 secs, weak user, user refreshed 113 secs ago) ... User ff host lookup failed: Timeout - queued for 30 secs (Ring synced for 7427 secs, weak user, user refreshed 620 secs ago)
This continued, combined with occasional login timeouts (as reported by some internal imap clients.) The login delays/timeouts got bad enough that our load balancers dropped both the servers while I was investigating. They seem to be okay after being restarted.
After the first few minutes, did all the rest of the error messages contain "weak user" string? Did this happen to a lot of different users (few/some/most)? director_user_expire setting is the default 15 minutes?
No, there continued to be a mix of both. The pattern seems to look like this. I'll run some stats later but it looks like a pretty significant number of users where affected.
09:25:21 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5032 secs) 09:25:55 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5066 secs, weak user, user refreshed 64 secs ago) 09:26:28 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5099 secs, weak user, user refreshed 97 secs ago)
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
On 09/26/12 11:06, Kelsey Cummings wrote:
No, there continued to be a mix of both. The pattern seems to look like this. I'll run some stats later but it looks like a pretty significant number of users where affected.
Timo, it looks like the total number of affected users was only about 250 and that most of their erred connections were surrounded by successful sessions.
-K
On 26.9.2012, at 21.06, Kelsey Cummings wrote:
09:25:21 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5032 secs) 09:25:55 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5066 secs, weak user, user refreshed 64 secs ago) 09:26:28 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5099 secs, weak user, user refreshed 97 secs ago)
Looks like I had broken this in v2.1.8. http://hg.dovecot.org/dovecot-2.1/rev/e4c337f38ed6 fixes this. I also added a bunch of other things to give better error messages and to try to fix any unexpected problems.
On Mon, Oct 22, 2012 at 03:39:34PM +0300, Timo Sirainen wrote:
On 26.9.2012, at 21.06, Kelsey Cummings wrote:
09:25:21 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5032 secs) 09:25:55 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5066 secs, weak user, user refreshed 64 secs ago) 09:26:28 .. User X host lookup failed: Timeout - queued for 30 secs (Ring synced for 5099 secs, weak user, user refreshed 97 secs ago)
Looks like I had broken this in v2.1.8. http://hg.dovecot.org/dovecot-2.1/rev/e4c337f38ed6 fixes this. I also added a bunch of other things to give better error messages and to try to fix any unexpected problems.
Thanks Timo!
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
participants (2)
-
Kelsey Cummings
-
Timo Sirainen