[Dovecot] TIMO HELP! director ring wont stay connected
I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning.
There isn't anything obvious in the logs beyond the fact that the director connections started bouncing. It was not resolved by reloads or restarts or an upgrade to 2.1.9 (only the directors.)
I've dropped one of the servers out of the ring to prevent suffering but this is a less than ideal situation.
Any idea what is going on? Load today is consistent with low weekend load since it is a major US holiday so this wouldn't appear to be a load related issue.
auth_master_user_separator = * auth_username_format = %Ln auth_verbose = yes default_client_limit = 10000 director_mail_servers = 1.1.1.1 1.1.1.2 1.1.1.3 1.1.1.4 director_servers = a.director.foo b.director.foo disable_plaintext_auth = no doveadm_proxy_port = 1842 login_trusted_networks = 10.0.0.1 mbox_write_locks = fcntl passdb { args = /etc/dovecot/master-users driver = passwd-file master = yes pass = yes } passdb { args = proxy=y nopassword=y driver = static } service anvil { client_limit = 20103 } service auth { client_limit = 41704 } service director { fifo_listener login/proxy-notify { mode = 0666 } inet_listener { port = 9321 } unix_listener login/director { mode = 0666 } } service imap-login { executable = imap-login director process_limit = 20000 process_min_avail = 32 } service imap { process_limit = 20480 } service pop3-login { executable = pop3-login director process_limit = 20000 process_min_avail = 32 } ssl_ca = </etc/dovecot/ssl/gd_bundle.crt ssl_cert = </etc/dovecot/ssl/imap.sonic.net.crt ssl_key = </etc/dovecot/ssl/imap.sonic.net.key ssl_parameters_regenerate = 1 days syslog_facility = local0 protocol imap { ssl_ca = </etc/dovecot/ssl/gd_bundle.crt ssl_cert = </etc/dovecot/ssl/imap.sonic.net.crt ssl_key = </etc/dovecot/ssl/imap.sonic.net.key } protocol pop3 { ssl_ca = </etc/dovecot/ssl/pop.sonic.net.ca-bundle ssl_cert = </etc/dovecot/ssl/pop.sonic.net.crt ssl_key = </etc/dovecot/ssl/pop.sonic.net.key }
Sep 3 09:22:42 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:22:45 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected Sep 3 09:22:49 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:22:53 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:22:54 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected Sep 3 09:22:59 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:23:02 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:23:02 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:02 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:23:02 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User foo host lookup failed: Timeout - queued for 47 secs (Ring synced for 30 secs, weak user, user refreshed 47 secs ago) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bar host lookup failed: Timeout - queued for 38 secs (Ring synced for 30 secs, weak user, user refreshed 38 secs ago) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bla host lookup failed: Timeout - queued for 30 secs (Ring synced for 30 secs) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bla2 host lookup failed: Timeout - queued for 30 secs (Ring synced for 30 secs) Sep 3 09:23:32 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:23:32 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:32 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:32 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/out disconnected before handshake finished Sep 3 09:23:32 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:24:02 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/out disconnected before handshake finished Sep 3 09:24:02 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:24:05 a.director. a dovecot: director: Warning: Delaying new user requests until ring is synced Sep 3 09:24:32 a.director. a dovecot: director: Warning: Ring is synced, continuing delayed requests Sep 3 09:24:41 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:24:41 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:24:41 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:24:41 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:25:11 b.director. b dovecot: director: Error: User hash 2285697953 is being redirected to two hosts: 10.10.10.39 and 10.10.10.76 (old_ts=1346689481,handshaking,recv_ts=1346689467) Sep 3 09:25:12 b.director. b dovecot: director: Error: User hash 623192092 is being redirected to two hosts: 10.10.10.76 and 10.10.10.39 (old_ts=1346689481,handshaking,recv_ts=1346689468) Sep 3 09:25:12 b.director. b dovecot: director: Error: User hash 1683990717 is being redirected to two hosts: 10.10.10.43 and 10.10.10.76 (old_ts=1346689481,handshaking,recv_ts=1346689468) Sep 3 09:25:12 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:25:12 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director lef
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
Am 03.09.2012 20:26, schrieb Kelsey Cummings:
I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning.
meanwhile try upgrade to 2.1.9
i cant see big changes there for director which should brake something http://hg.dovecot.org/dovecot-2.1/rev/bc86680293d2
however its better debug with latest patchlevel
Best Regards MfG Robert Schetterer
On 9/3/2012 11:32 AM, Robert Schetterer wrote:
Am 03.09.2012 20:26, schrieb Kelsey Cummings:
I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning.
meanwhile try upgrade to 2.1.9
I did that in the second sentence. ;)
One of the servers had a single segfault but this wasn't the original cause and didn't show up until after I was poking.
director[30709]: segfault at 40 ip 0000000000405d37 sp 00007fff42af80e0 error 6 in director[400000+f000]
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
On 3.9.2012, at 21.26, Kelsey Cummings wrote:
I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning.
There isn't anything obvious in the logs beyond the fact that the director connections started bouncing. It was not resolved by reloads or restarts or an upgrade to 2.1.9 (only the directors.)
Did you try stopping both and then starting them again? That clears up all the state they have.
Any idea what is going on? Load today is consistent with low weekend load since it is a major US holiday so this wouldn't appear to be a load related issue.
Directors themselves think they're having trouble connecting to each others.. Annoyingly it doesn't give specific error messages about what happened. I should improve the logging..
If the state clearing doesn't help, maybe this has something to do with the OS or the network is really having some issues.
On 9/3/2012 12:06 PM, Timo Sirainen wrote:
Did you try stopping both and then starting them again? That clears up all the state they have.
I'm not sure that they were both down when restarting them and will try this tonight.
If the state clearing doesn't help, maybe this has something to do with the OS or the network is really having some issues.
I can't rule that out but there are not any signs that there are any hardware, OS or network related issues.
Thanks for gettting the ring status into doveadm by the way. At least our monitoring caught this quickly.
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
On 09/03/12 12:06, Timo Sirainen wrote:
On 3.9.2012, at 21.26, Kelsey Cummings wrote:
I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning.
There isn't anything obvious in the logs beyond the fact that the director connections started bouncing. It was not resolved by reloads or restarts or an upgrade to 2.1.9 (only the directors.)
Did you try stopping both and then starting them again? That clears up all the state they have.
I stopped both directors last night and they were able to stay in sync after they were restarted. Could corruption of the in memory state lead to the connections being dropped?
If this happens again I'll try to get a tcpdump and an strace so the bug can get squashed.
-K
On 5.9.2012, at 3.58, Timo Sirainen wrote:
On 3.9.2012, at 21.26, Kelsey Cummings wrote:
passdb { args = proxy=y nopassword=y driver = static }
I wonder if someone was doing a ton of logins for different usernames? This kind of setup where director doesn't verify the username can be attacked that way.
Although the extra users should be freed from the memory after 15 minutes.
Hmm. Once Dovecot supports moving existing connections from one backend server to another without the client noticing anything, the director could be simplified by using consistent hashing and when the number of backends changes, the director could start moving connections to their proper backends. During this move new connections would be handled by 1) if old backend = new backend just forward the connection there or 2) if they're different, request immediate move for that user's existing connections and wait for it to be finished before letting new connections finish. Or alternatively if the user isn't just being moved at that time, forward the connection to the old server and let it be part of the later move.
The main difference here is that directors wouldn't need to keep any track of user -> backend associations. The moving period could still be a bit tricky to handle well, especially since the situation can change again while a previous move is still going on.
On 9/4/2012 5:58 PM, Timo Sirainen wrote:
On 3.9.2012, at 21.26, Kelsey Cummings wrote:
passdb { args = proxy=y nopassword=y driver = static }
I wonder if someone was doing a ton of logins for different usernames? This kind of setup where director doesn't verify the username can be attacked that way.
It doesn't look like there was a higher than normal number of failed logins leading up to the connection issues. I'm going to write some more stats collection tools to track state on the directors and see what comes of it.
Can the director proxy validate the username via a unix pw lookup but not check the password?
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407
On 3.9.2012, at 21.26, Kelsey Cummings wrote:
Sep 3 09:22:42 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:22:45 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected Sep 3 09:22:49 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:22:53 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:22:54 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected Sep 3 09:22:59 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:23:02 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected
All of these connections had finished handshaking. They simply just seemed to disconnect the connection for no reason. I found one reason for why that could happen, which could explain all of this:
http://hg.dovecot.org/dovecot-2.1/rev/24e791bbcf69
The user weakness is remembered until all directors are shut down or all directors have acknowledged the weakness.
Sep 3 09:23:02 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left
The user weakness should have been removed at this point, but looks like there's code missing for it for 1-director-only setup. I'll fix that soon..
Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User foo host lookup failed: Timeout - queued for 47 secs (Ring synced for 30 secs, weak user, user refreshed 47 secs ago) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bar host lookup failed: Timeout - queued for 38 secs (Ring synced for 30 secs, weak user, user refreshed 38 secs ago)
These are the weak users causing the trouble.
This improves logging: http://hg.dovecot.org/dovecot-2.1/rev/27d3289e1f5c
Timo, I'm glad to hear that you found a plausible cause and have got a patch in to fix it.
-K
participants (3)
-
Kelsey Cummings
-
Robert Schetterer
-
Timo Sirainen