Hello List
I have a strange problem here which i try to analyse, but i'm stuck. Maybe someone has a hint?
What happened: A few weeks ago one of the LDAPS Servers which is not maintained by us has crashed. From that moment on, users could still login to check their emails, but they were not able to send any email through postfix (which uses smtpd_sasl_type = dovecot)
What i do not understand, is why did dovecot not switch to the second configured LDAPS Server? It looks like it retried for ever to reconnect to the crashed LDAP Server.
From the moment of the crash we see a lot of Errors like these in our logfiles:
Nov 30 16:51:53 servername dovecot: [ID 583609 mail.error] auth: Error: ldap(userone,USERS_IP1,<WKiTeBUJQQBUSAnE>): Connection appears to be hanging, reconnecting
AND
Nov 30 16:51:59 servername dovecot: [ID 583609 mail.error] auth: Error: plain(usertwo,USERS_IP2,<QgJvcBUJqABTTWrJ>): Request 1982.83548 timed out after 151 secs, state=1
The used dovecot version is 2.2.13, runs on a solaris 10 system and the configuration for passdb and userdb are:
passdb { args = /etc/dovecot-ldap.conf default_fields = deny = no driver = ldap master = no name = override_fields = pass = no result_failure = continue result_internalfail = continue result_success = return-ok skip = never }
userdb { args = /etc/dovecot-ldap.conf default_fields = driver = ldap name = override_fields = result_failure = continue result_internalfail = continue result_success = return-ok skip = never }
And the dovecot-ldap.conf contains (obfuscated):
uris = ldaps://server2.tld ldaps://server1.tld ldaps://server4.tld ldaps://server3.tld dn = ... dnpass = ... ldap_version = 3 auth_bind = yes base = ... scope = onelevel user_attrs = homeDirectory=home,uidNumber=uid,gidNumber=gid user_filter = ... pass_attrs = uid=user pass_filter = ...
The strange thing is, that with the very same binaries and configuration (okay, some minimal modifications have been made to bind to the correct interfaces...) a test on our testsystem works as it should.
When we shutdown slapd, dovecot recognizes it an connects to the alternate LDAPS. When we shutdown slapd and start a netcat (just to let something listening without responding)... you guess it. Dovecot does recognize it and switches over to the alternate testsystem.
So on our testsystem, everything worked as it should. But the productive system did not. And since the LDAPS are not maintained by us it is somewhat hard to try to reproduce something.
At least i got the logfiles from server2.tld and server1.tld. But they only show what i still knew. Our server connected to server2.tld until the crash happened. But server1.tld never got any connection.
Has someone an idea what i could try to find out why dovecot did not switch to server1.tld?
Best regards Matthias Egger
Matthias Egger ETH Zurich Department of Information Technology maegger@ee.ethz.ch and Electrical Engineering IT Support Group (ISG.EE), ETL/F/24.1 Phone +41 (0)44 632 03 90 Physikstrasse 3, CH-8092 Zurich Fax +41 (0)44 632 11 95