I've had 2x director ring up and running with production load on 2.1.8 with around 10,000 active connections for two weeks and everything has been working great - until this morning.
There isn't anything obvious in the logs beyond the fact that the director connections started bouncing. It was not resolved by reloads or restarts or an upgrade to 2.1.9 (only the directors.)
I've dropped one of the servers out of the ring to prevent suffering but this is a less than ideal situation.
Any idea what is going on? Load today is consistent with low weekend load since it is a major US holiday so this wouldn't appear to be a load related issue.
auth_master_user_separator = * auth_username_format = %Ln auth_verbose = yes default_client_limit = 10000 director_mail_servers = 1.1.1.1 1.1.1.2 1.1.1.3 1.1.1.4 director_servers = a.director.foo b.director.foo disable_plaintext_auth = no doveadm_proxy_port = 1842 login_trusted_networks = 10.0.0.1 mbox_write_locks = fcntl passdb { args = /etc/dovecot/master-users driver = passwd-file master = yes pass = yes } passdb { args = proxy=y nopassword=y driver = static } service anvil { client_limit = 20103 } service auth { client_limit = 41704 } service director { fifo_listener login/proxy-notify { mode = 0666 } inet_listener { port = 9321 } unix_listener login/director { mode = 0666 } } service imap-login { executable = imap-login director process_limit = 20000 process_min_avail = 32 } service imap { process_limit = 20480 } service pop3-login { executable = pop3-login director process_limit = 20000 process_min_avail = 32 } ssl_ca =
Sep 3 09:22:42 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:22:45 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected Sep 3 09:22:49 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:22:53 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:22:54 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/left disconnected Sep 3 09:22:59 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/left disconnected Sep 3 09:23:02 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:23:02 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:02 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:23:02 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User foo host lookup failed: Timeout - queued for 47 secs (Ring synced for 30 secs, weak user, user refreshed 47 secs ago) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bar host lookup failed: Timeout - queued for 38 secs (Ring synced for 30 secs, weak user, user refreshed 38 secs ago) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bla host lookup failed: Timeout - queued for 30 secs (Ring synced for 30 secs) Sep 3 09:23:32 a.director. a dovecot: director: Error: director: User bla2 host lookup failed: Timeout - queued for 30 secs (Ring synced for 30 secs) Sep 3 09:23:32 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:23:32 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:32 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:23:32 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/out disconnected before handshake finished Sep 3 09:23:32 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:24:02 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/out disconnected before handshake finished Sep 3 09:24:02 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:24:05 a.director. a dovecot: director: Warning: Delaying new user requests until ring is synced Sep 3 09:24:32 a.director. a dovecot: director: Warning: Ring is synced, continuing delayed requests Sep 3 09:24:41 b.director. b dovecot: director: Error: Director 10.10.10.71:9321/right disconnected Sep 3 09:24:41 b.director. b dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:24:41 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:24:41 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director left Sep 3 09:25:11 b.director. b dovecot: director: Error: User hash 2285697953 is being redirected to two hosts: 10.10.10.39 and 10.10.10.76 (old_ts=1346689481,handshaking,recv_ts=1346689467) Sep 3 09:25:12 b.director. b dovecot: director: Error: User hash 623192092 is being redirected to two hosts: 10.10.10.76 and 10.10.10.39 (old_ts=1346689481,handshaking,recv_ts=1346689468) Sep 3 09:25:12 b.director. b dovecot: director: Error: User hash 1683990717 is being redirected to two hosts: 10.10.10.43 and 10.10.10.76 (old_ts=1346689481,handshaking,recv_ts=1346689468) Sep 3 09:25:12 a.director. a dovecot: director: Error: Director 10.10.10.37:9321/right disconnected Sep 3 09:25:12 a.director. a dovecot: director: Warning: director: Couldn't connect to right side, we must be the only director lef
-- Kelsey Cummings - kgc@corp.sonic.net sonic.net, inc. System Architect 2260 Apollo Way 707.522.1000 Santa Rosa, CA 95407