LDAP: Connection appears to be hanging, reconnecting
Hello List
I have a strange problem here which i try to analyse, but i'm stuck. Maybe someone has a hint?
What happened: A few weeks ago one of the LDAPS Servers which is not maintained by us has crashed. From that moment on, users could still login to check their emails, but they were not able to send any email through postfix (which uses smtpd_sasl_type = dovecot)
What i do not understand, is why did dovecot not switch to the second configured LDAPS Server? It looks like it retried for ever to reconnect to the crashed LDAP Server.
From the moment of the crash we see a lot of Errors like these in our logfiles:
Nov 30 16:51:53 servername dovecot: [ID 583609 mail.error] auth: Error: ldap(userone,USERS_IP1,<WKiTeBUJQQBUSAnE>): Connection appears to be hanging, reconnecting
AND
Nov 30 16:51:59 servername dovecot: [ID 583609 mail.error] auth: Error: plain(usertwo,USERS_IP2,<QgJvcBUJqABTTWrJ>): Request 1982.83548 timed out after 151 secs, state=1
The used dovecot version is 2.2.13, runs on a solaris 10 system and the configuration for passdb and userdb are:
passdb { args = /etc/dovecot-ldap.conf default_fields = deny = no driver = ldap master = no name = override_fields = pass = no result_failure = continue result_internalfail = continue result_success = return-ok skip = never }
userdb { args = /etc/dovecot-ldap.conf default_fields = driver = ldap name = override_fields = result_failure = continue result_internalfail = continue result_success = return-ok skip = never }
And the dovecot-ldap.conf contains (obfuscated):
uris = ldaps://server2.tld ldaps://server1.tld ldaps://server4.tld ldaps://server3.tld dn = ... dnpass = ... ldap_version = 3 auth_bind = yes base = ... scope = onelevel user_attrs = homeDirectory=home,uidNumber=uid,gidNumber=gid user_filter = ... pass_attrs = uid=user pass_filter = ...
The strange thing is, that with the very same binaries and configuration (okay, some minimal modifications have been made to bind to the correct interfaces...) a test on our testsystem works as it should.
When we shutdown slapd, dovecot recognizes it an connects to the alternate LDAPS. When we shutdown slapd and start a netcat (just to let something listening without responding)... you guess it. Dovecot does recognize it and switches over to the alternate testsystem.
So on our testsystem, everything worked as it should. But the productive system did not. And since the LDAPS are not maintained by us it is somewhat hard to try to reproduce something.
At least i got the logfiles from server2.tld and server1.tld. But they only show what i still knew. Our server connected to server2.tld until the crash happened. But server1.tld never got any connection.
Has someone an idea what i could try to find out why dovecot did not switch to server1.tld?
Best regards Matthias Egger
Matthias Egger ETH Zurich Department of Information Technology maegger@ee.ethz.ch and Electrical Engineering IT Support Group (ISG.EE), ETL/F/24.1 Phone +41 (0)44 632 03 90 Physikstrasse 3, CH-8092 Zurich Fax +41 (0)44 632 11 95
On 16/12/14 16:30, Matthias Egger wrote:
What happened: A few weeks ago one of the LDAPS Servers which is not maintained by us has crashed. From that moment on, users could still login to check their emails, but they were not able to send any email through postfix (which uses smtpd_sasl_type = dovecot)
What i do not understand, is why did dovecot not switch to the second configured LDAPS Server? It looks like it retried for ever to reconnect to the crashed LDAP Server.
This is speculation, but what has happened to us in the past is that the LDAP server stopped responding to queries, but the TCP socket was still open for connections. A new TCP connection would be established, but the daemon would not be notified of it.
So, depending on precisely how the first LDAP server crashed, it may not be the same test as killing the process, but closer to sending it 'kill -STOP' (and then 'kill -CONT' afterwards, obviously)
Simon.
-- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
Hello Simon
On 12/16/2014 05:38 PM, Simon Fraser wrote:
This is speculation, but what has happened to us in the past is that the LDAP server stopped responding to queries, but the TCP socket was still open for connections. A new TCP connection would be established, but the daemon would not be notified of it.
So, depending on precisely how the first LDAP server crashed, it may not be the same test as killing the process, but closer to sending it 'kill -STOP' (and then 'kill -CONT' afterwards, obviously)
Thank you very much for that hint. You were right. When i -SIGSTOP the slapd i receive a similar behaviour of dovecot as we had a few weeks ago.
So do you (or someone other) has a hint on how i could work around such a situation?
I found a statement from Timo Sirainen from June 2011:
http://www.dovecot.org/pipermail/dovecot/2011-June/059905.html
"...Fallbacking to another LDAP server is done by OpenLDAP internally..."
So i thought, there should be a possibility to "tweak" the ldap.conf.
I then found a german Post:
https://listen.jpberlin.de/pipermail/dovecot/2014-June/000506.html
Where someone mentioned some ldap.conf Settings:
BIND_POLICY soft TIMELIMIT 5 NETWORK_TIMEOUT 5 TIMEOUT 8
and a link to:
http://www.linuxquestions.org/questions/linux-enterprise-47/ldap-failover-ti...
which also uses these two settings:
BIND_TIMELIMIT 10 IDE_TIMELIMIT 10
I gave i try to them, but the result was still the same. Dovecot respectively OpenLDAP does not switch to another LDAP.
Best regards Matthias
Matthias Egger ETH Zurich Department of Information Technology maegger@ee.ethz.ch and Electrical Engineering IT Support Group (ISG.EE), ETL/F/24.1 Phone +41 (0)44 632 03 90 Physikstrasse 3, CH-8092 Zurich Fax +41 (0)44 632 11 95
participants (2)
-
Matthias Egger
-
Simon Fraser