LDAP: Connection appears to be hanging, reconnecting

Tue Dec 16 16:30:53 UTC 2014

Hello List

I have a strange problem here which i try to analyse, but i'm stuck.
Maybe someone has a hint?

What happened:
A few weeks ago one of the LDAPS Servers which is not maintained by us
has crashed. From that moment on, users could still login to check their
emails, but they were not able to send any email through postfix (which
uses smtpd_sasl_type = dovecot)

What i do not understand, is why did dovecot not switch to the second
configured LDAPS Server? It looks like it retried for ever to reconnect
to the crashed LDAP Server.

From the moment of the crash we see a lot of Errors like these in our
logfiles:

Nov 30 16:51:53 servername dovecot: [ID 583609 mail.error] auth: Error:
ldap(userone,USERS_IP1,<WKiTeBUJQQBUSAnE>): Connection appears to be
hanging, reconnecting

AND

Nov 30 16:51:59 servername dovecot: [ID 583609 mail.error] auth: Error:
plain(usertwo,USERS_IP2,<QgJvcBUJqABTTWrJ>): Request 1982.83548 timed
out after 151 secs, state=1

The used dovecot version is 2.2.13, runs on a solaris 10 system and the
configuration for passdb and userdb are:

passdb {
  args = /etc/dovecot-ldap.conf
  default_fields =
  deny = no
  driver = ldap
  master = no
  name =
  override_fields =
  pass = no
  result_failure = continue
  result_internalfail = continue
  result_success = return-ok
  skip = never
}

userdb {
  args = /etc/dovecot-ldap.conf
  default_fields =
  driver = ldap
  name =
  override_fields =
  result_failure = continue
  result_internalfail = continue
  result_success = return-ok
  skip = never
}

And the dovecot-ldap.conf contains (obfuscated):

uris             = ldaps://server2.tld ldaps://server1.tld
ldaps://server4.tld ldaps://server3.tld
dn               = ...
dnpass           = ...
ldap_version     = 3
auth_bind        = yes
base             = ...
scope            = onelevel
user_attrs       = homeDirectory=home,uidNumber=uid,gidNumber=gid
user_filter      = ...
pass_attrs       = uid=user
pass_filter      = ...

The strange thing is, that with the very same binaries and configuration
(okay, some minimal modifications have been made to bind to the correct
interfaces...) a test on our testsystem works as it should.

When we shutdown slapd, dovecot recognizes it an connects to the
alternate LDAPS. When we shutdown slapd and start a netcat (just to let
something listening without responding)... you guess it. Dovecot does
recognize it and switches over to the alternate testsystem.

So on our testsystem, everything worked as it should. But the productive
system did not. And since the LDAPS are not maintained by us it is
somewhat hard to try to reproduce something.

At least i got the logfiles from server2.tld and server1.tld. But they
only show what i still knew. Our server connected to server2.tld until
the crash happened. But server1.tld never got any connection.

Has someone an idea what i could try to find out why dovecot did not
switch to server1.tld?

Best regards
Matthias Egger
-- 
Matthias Egger
ETH Zurich
Department of Information Technology          maegger at ee.ethz.ch
and Electrical Engineering
IT Support Group (ISG.EE), ETL/F/24.1         Phone +41 (0)44 632 03 90
Physikstrasse 3, CH-8092 Zurich               Fax   +41 (0)44 632 11 95

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4099 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://dovecot.org/pipermail/dovecot/attachments/20141216/13503602/attachment.p7s>