[Dovecot] LDAP authentication stops working...
Hi all,
I'm running dovecot-1.0.rc17 on OpenBSD 3.9, using userdb and passdb methods of "ldap" (SSL on 636/tcp) in addition to "passwd".
Occasionally (generally after a few hours of operation, but not always), LDAP-based logins stop working (e.g. hang/timeout after POP3 PASS command). Accounts with local passwords (as opposed to accounts with a password field of "x") still work fine at this point.
'tcpdump' shows the persisient 636/tcp connection still with bidirectional traffic (PUSH/ACKs and ACKs like normal).
After restarting Dovecot, things are fine until next time.
I guess I could try and get login_ldap working with bsdauth, but if there's an issue in the Dovecot LDAP support I'd like to see it fixed.
Any ideas?
Thanks,
Adrian Close
P.S. I note the LDAP timeout logging patch but it won't apply easily to rc17 and I'm hoping the issue has progressed since then anyway...
<quote who="Adrian Close"> > Hi all, > > I'm running dovecot-1.0.rc17 on OpenBSD 3.9, using userdb and passdb > methods of "ldap" (SSL on 636/tcp) in addition to "passwd". > > Occasionally (generally after a few hours of operation, but not always), > LDAP-based logins stop working (e.g. hang/timeout after POP3 PASS > command). Accounts with local passwords (as opposed to accounts with a > password field of "x") still work fine at this point.
We also get this. Twice a day we have to restart dovecot, using userdb and passdb via LDAP, with userdb_prefetch.
Also, if you shutdown dovecot and try and restart OpenLDAP, the dovecot ldap bind/thread is still open and OpenLDAP sits there waiting for it.
I've discussed this with the OpenLDAP dev team, and they suggested attaching gdb to the slapd pid and seeing what's going on.
Any ideas from the dovecot side?
'tcpdump' shows the persisient 636/tcp connection still with bidirectional traffic (PUSH/ACKs and ACKs like normal).
After restarting Dovecot, things are fine until next time.
Likewise.
I guess I could try and get login_ldap working with bsdauth, but if there's an issue in the Dovecot LDAP support I'd like to see it fixed.
I agree. We are the only ones using dovecot this way, all our other clients use nssldap and pamldap, and there are no probs with dovecot that way.
P.S. I note the LDAP timeout logging patch but it won't apply easily to rc17 and I'm hoping the issue has progressed since then anyway...
Oh, never tried that.
-- Kind Regards,
Gavin Henry. Managing Director.
T +44 (0) 1224 279484 M +44 (0) 7930 323266 F +44 (0) 1224 824887 E ghenry@suretecsystems.com
Open Source. Open Solutions(tm).
On Tue, 2007-01-09 at 09:54 +0000, Gavin Henry wrote:
<quote who="Adrian Close"> > Hi all, > > I'm running dovecot-1.0.rc17 on OpenBSD 3.9, using userdb and passdb > methods of "ldap" (SSL on 636/tcp) in addition to "passwd". > > Occasionally (generally after a few hours of operation, but not always), > LDAP-based logins stop working (e.g. hang/timeout after POP3 PASS > command). Accounts with local passwords (as opposed to accounts with a > password field of "x") still work fine at this point.
We also get this. Twice a day we have to restart dovecot, using userdb and passdb via LDAP, with userdb_prefetch.
Just to add: we moved from rc
The result handler for the initial ldap_search to find the dn to bind to is never called. I assume Timo fscked something up recently in my auth_bind code ;-P
Anyway, restarting ldap every hour or so with cron does the job :-|
I'd debug this if I had the time, but I won't have before next week.
Cheers, Marc
On Tue, 2007-01-09 at 09:54 +0000, Gavin Henry wrote:
<quote who="Adrian Close">
Hi all,
I'm running dovecot-1.0.rc17 on OpenBSD 3.9, using userdb and passdb methods of "ldap" (SSL on 636/tcp) in addition to "passwd".
Occasionally (generally after a few hours of operation, but not always), LDAP-based logins stop working (e.g. hang/timeout after POP3 PASS command). Accounts with local passwords (as opposed to accounts with a password field of "x") still work fine at this point.
We also get this. Twice a day we have to restart dovecot, using userdb and passdb via LDAP, with userdb_prefetch.
Just to add: we moved from rc
to rc15 recently, and we now also see a lot of hangs with ldap_authbind. The result handler for the initial ldap_search to find the dn to bind to is never called. I assume Timo fscked something up recently in my auth_bind code ;-P
Anyway, restarting ldap every hour or so with cron does the job :-|
I'd debug this if I had the time, but I won't have before next week.
Cheers, Marc
*RT IMSS Scanned* I get this problem as well, with dovecot running on a server running Fedora5 I first noticed this problem after a yum update that moved the server upto
On Wednesday 10 January 2007 15:07, J.M. Maurer wrote: the rpm dovecot-1.0.0.beta8.3.fc5.i386 The server original ran OK after it was originaly upgraded to Fedora5, which shiped with the rpm dovecot-1.0.0.beta2.7.i386
The LDAP server is openldap-2.0.27-8 running redhat 9.0
The minimum to fix the problem seems to be kill the dovecot auth processes.
-- J. David Rye http://www.roadrunner.uk.com http://www.roadtech.co.uk mailto://d.rye@roadtech.co.uk
This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us
I get this problem as well, with dovecot running on a server running Fedora5 I first noticed this problem after a yum update that moved the server upto the rpm dovecot-1.0.0.beta8.3.fc5.i386 The server original ran OK after it was originaly upgraded to Fedora5, which shiped with the rpm dovecot-1.0.0.beta2.7.i386
The LDAP server is openldap-2.0.27-8 running redhat 9.0
The minimum to fix the problem seems to be kill the dovecot auth processes.
Well, since all of these versions are extremely outdated, the first thing to do is update dovecot - latest is rc17.
--
Best regards,
Charles
I get this problem as well, with dovecot running on a server running Fedora5 I first noticed this problem after a yum update that moved the server upto the rpm dovecot-1.0.0.beta8.3.fc5.i386 The server original ran OK after it was originaly upgraded to Fedora5, which shiped with the rpm dovecot-1.0.0.beta2.7.i386
The LDAP server is openldap-2.0.27-8 running redhat 9.0
The minimum to fix the problem seems to be kill the dovecot auth processes.
Well, since all of these versions are extremely outdated, the first thing to do is update dovecot - latest is rc17.
--
Best regards,
Charles
*RT IMSS Scanned*
Usualy I would be minded to agree, but this problem seems to be affecting versions between rc8 and rc17
Adrian Hall reported problem with rc17 J.M. Maurer with rc15 Me with the current Fedora 5 RPM beta 8
In this case I suspect it may be easier to work backwards until the problem goes away then see what was changed to cause the problem than try and debug an intermittent problem in the current code.
This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us
I get this problem as well, with dovecot running on a server running Fedora5 I first noticed this problem after a yum update that moved the server upto the rpm dovecot-1.0.0.beta8.3.fc5.i386 The server original ran OK after it was originaly upgraded to Fedora5, which shiped with the rpm dovecot-1.0.0.beta2.7.i386
The LDAP server is openldap-2.0.27-8 running redhat 9.0
The minimum to fix the problem seems to be kill the dovecot auth processes.
Well, since all of these versions are extremely outdated, the first thing to do is update dovecot - latest is rc17.
--
Best regards,
Charles
*RT IMSS Scanned*
Usualy I would be minded to agree, but this problem seems to be affecting versions between rc8 and rc17
Adrian Hall reported problem with rc17 J.M. Maurer with rc15 Me with the current Fedora 5 RPM beta 8
In this case I suspect it may be easier to work backwards until the problem goes away then see what was changed to cause the problem than try and debug an intermittent problem in the current code.
This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us
On Thu, 11 Jan 2007, J. David Rye of Roadtech wrote:
In this case I suspect it may be easier to work backwards until the problem goes away then see what was changed to cause the problem than try and debug an intermittent problem in the current code.
Plausible, I guess. Except that for me it takes a day to run a test (the problem mostly only happens when our NY users come online, while I'm asleep), so it'd be great to narrow it down a little. ;)
Thanks all for the replies, BTW - at least I know it's not just me!
Adrian.
On Tue, 2007-01-09 at 09:54 +0000, Gavin Henry wrote:
<quote who="Adrian Close">
Hi all,
I'm running dovecot-1.0.rc17 on OpenBSD 3.9, using userdb and passdb methods of "ldap" (SSL on 636/tcp) in addition to "passwd".
Occasionally (generally after a few hours of operation, but not always), LDAP-based logins stop working (e.g. hang/timeout after POP3 PASS command). Accounts with local passwords (as opposed to accounts with a password field of "x") still work fine at this point.
We also get this. Twice a day we have to restart dovecot, using userdb and passdb via LDAP, with userdb_prefetch.
Just to add: we moved from rc
to rc15 recently, and we now also see a lot of hangs with ldap_authbind. The result handler for the initial ldap_search to find the dn to bind to is never called. I assume Timo fscked something up recently in my auth_bind code ;-P
Anyway, restarting ldap every hour or so with cron does the job :-|
I'd debug this if I had the time, but I won't have before next week.
Cheers, Marc
*RT IMSS Scanned* I get this problem as well, with dovecot running on a server running Fedora5 I first noticed this problem after a yum update that moved the server upto
On Wednesday 10 January 2007 15:07, J.M. Maurer wrote: the rpm dovecot-1.0.0.beta8.3.fc5.i386 The server original ran OK after it was originaly upgraded to Fedora5, which shiped with the rpm dovecot-1.0.0.beta2.7.i386
The LDAP server is openldap-2.0.27-8 running redhat 9.0
The minimum to fix the problem seems to be kill the dovecot auth processes.
Its happened again.
Jan 11 08:16:21 viruswall-1 dovecot: auth(default): client in: AUTH 1
PLAIN service=POP3 lip=195.245.100.152 rip=172.
16.20.72 resp=<hidden>
Jan 11 08:16:21 viruswall-1 dovecot: auth(default): client out: CONT 1
Jan 11 08:16:21 viruswall-1 dovecot: auth(default): client in: CONT<hidden>
Jan 11 08:16:21 viruswall-1 dovecot: auth(default): client out: OK 1
user=mikey
Jan 11 08:16:21 viruswall-1 dovecot: auth(default): master in: REQUEST
2354 1718 1
Jan 11 08:16:21 viruswall-1 dovecot: auth(default): master out: USER
2354 mikey system_user=mikey uid=1011 gid=513
home=/home/mikey
Jan 11 08:16:21 viruswall-1 dovecot: pop3-login: Login: user=<mikey>,
method=PLAIN, rip=172.16.20.72, lip=195.245.100.152
Jan 11 08:16:21 viruswall-1 dovecot: POP3(mikey): Disconnected: Logged out
top=0/0, retr=0/0, del=0/772, size=8769550
Jan 11 08:16:40 viruswall-1 dovecot: pop3-login: Disconnected:
rip=172.16.20.108, lip=195.245.100.152
Jan 11 08:16:54 viruswall-1 dovecot: auth(default): client in: AUTH 1
PLAIN service=POP3 lip=195.245.100.152 rip=172.
16.20.31 resp=<hidden>
Jan 11 08:16:54 viruswall-1 dovecot: auth(default): client in: AUTH 1
PLAIN service=POP3 lip=195.245.100.152 rip=172.
16.24.161 resp=<hidden>
Jan 11 08:16:54 viruswall-1 dovecot: auth(default): client out: CONT 1
Jan 11 08:16:54 viruswall-1 dovecot: auth(default): client in: CONT<hidden>
Jan 11 08:16:54 viruswall-1 dovecot: auth(default): client out: OK 1
user=wrosen
Jan 11 08:16:54 viruswall-1 dovecot: auth(default): master in: REQUEST
2355 1867 1
Jan 11 08:17:47 viruswall-1 dovecot: pop3-login: Disconnected:
rip=172.16.20.108, lip=195.245.100.152
Jan 11 08:17:54 viruswall-1 dovecot: pop3-login: Disconnected: Inactivity:
method=PLAIN, rip=172.16.24.161, lip=195.245.100.152
Summary :- The login from 172.16.20.72 @ Jan 11 08:16:21 was successfull
Problem appears to start with the near concurrent logins at Jan 11 08:16:54 All of the LDAP lookup for the first of these run over the same tcp connection as the preceding calls to auth master,
dovecot-auth seems run mostly with the same tcp tcp socket for anything from 6 minutes to an hour then close the socket and open a new one. Very occasionally dovecot-auth exits logging the event BROKEN NSS IMPLEMENTATION It is then immediately restarted by dovecot.
Dovecot is configured to use pam for passdb and userdb with pam inturn configured to use LDAP.
At the time it hung there were two tcp connections with some of the queries for the second of the overlapping logins runs over a second tcp connection to the LDAP server.
ps/lsof show 3 dovecot-auth processes.
1 zombie 1 the parent with a single tcp connection to the LDAP server 1 a child with 2 connections to the LDAP server
This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us
On Thu, 2007-01-11 at 16:20 +0000, J. David Rye of Roadtech wrote:
dovecot-auth seems run mostly with the same tcp tcp socket for anything from 6 minutes to an hour then close the socket and open a new one. Very occasionally dovecot-auth exits logging the event BROKEN NSS IMPLEMENTATION It is then immediately restarted by dovecot.
Dovecot is configured to use pam for passdb and userdb with pam inturn configured to use LDAP.
So, you're not really using LDAP with Dovecot, you're using PAM and NSS. Most likely the reason for your problems is nss_ldap, and as you can see Dovecot even once in a while notices that it's broken. Try switching at least your userdb passwd to userdb ldap.
On Tue, 2007-01-09 at 13:59 +1100, Adrian Close wrote:
Occasionally (generally after a few hours of operation, but not always), LDAP-based logins stop working (e.g. hang/timeout after POP3 PASS command). Accounts with local passwords (as opposed to accounts with a password field of "x") still work fine at this point.
Are you using auth_bind=yes? That's been a bit broken. I finally today did several fixes to it so that it won't hang at least with my stress tests. The code is now in CVS and I built a new nightly snapshot if you want to try before rc18 release.
Looks like ldap_bind() can't really be used while anything else is going on. Now the code waits until all the existing LDAP searches are finished, then it does the bind, and then nothing is done until the bind is finished.
On Fri, 19 Jan 2007, Timo Sirainen wrote:
Are you using auth_bind=yes? That's been a bit broken. I finally today did several fixes to it so that it won't hang at least with my stress tests. The code is now in CVS and I built a new nightly snapshot if you want to try before rc18 release.
Ah. Yes, I went for auth_bind=yes because I prefer not to give privilleged access to anything where I can avoid it.
For the moment (so I could get some sleep) I'm now using "bsdauth" with login_ldap on OpenBSD, which doesn't fail but does put a lot more load on the LDAP infrastructure.
I'll try out rc18 when it's ready.
Thanks,
Adrian.
Are you using auth_bind=yes? That's been a bit broken. I finally today did several fixes to it so that it won't hang at least with my stress tests. The code is now in CVS and I built a new nightly snapshot if you want to try before rc18 release.
Looks like ldap_bind() can't really be used while anything else is going on. Now the code waits until all the existing LDAP searches are finished, then it does the bind, and then nothing is done until the bind is finished.
This fix seems to make it behave a lot better!
I think I can still trigger this issue though, by opening Outlook 2003 (sorry :X), and holding down F9 for a bit (send/receive shortcut). When the send/receive query contains about 5+ actions, outlook stops responding.
But still a LOT better then before. At least when the problem occurs, other users can still login, which wasn't the case before.
Cheers, Marc
On Mon, 2007-01-22 at 14:45 +0100, J.M. Maurer wrote:
Are you using auth_bind=yes? That's been a bit broken. I finally today did several fixes to it so that it won't hang at least with my stress tests. The code is now in CVS and I built a new nightly snapshot if you want to try before rc18 release.
Looks like ldap_bind() can't really be used while anything else is going on. Now the code waits until all the existing LDAP searches are finished, then it does the bind, and then nothing is done until the bind is finished.
This fix seems to make it behave a lot better!
I think I can still trigger this issue though, by opening Outlook 2003 (sorry :X), and holding down F9 for a bit (send/receive shortcut). When the send/receive query contains about 5+ actions, outlook stops responding.
But still a LOT better then before. At least when the problem occurs, other users can still login, which wasn't the case before.
Could it be because of Outlook? ;) I tested it with 100 simultaneous virtual clients doing login+logout all the time and didn't notice any hangs. You could set auth_debug=yes and look at the logs what happens when the F9 is being hold. Does it log an ldap_bind() that just doesn't seem to get a reply?
Hmm. Actually now that I think of it, the ldap_bind() must give a reply before the connection is used for anything else. So if the other users can still log in, I think there's a good chance that the bug isn't in Dovecot's side.
Hi Timo,
I think I can still trigger this issue though, by opening Outlook 2003 (sorry :X), and holding down F9 for a bit (send/receive shortcut). When the send/receive query contains about 5+ actions, outlook stops responding.
But still a LOT better then before. At least when the problem occurs, other users can still login, which wasn't the case before.
Could it be because of Outlook? ;)
agreed++ :)
I tested it with 100 simultaneous virtual clients doing login+logout all the time and didn't notice any hangs. You could set auth_debug=yes and look at the logs what happens when the F9 is being hold. Does it log an ldap_bind() that just doesn't seem to get a reply?
Actually, it binds just 1 time, which succeeds just fine. If you now press F9 900 times, it will work just fine. It sends/reveives over one connection, no new bind()'s, no hangs.
Now if you also start browsing some folders (for which Outlook opens a new connection) while the 900 send/receive's are being handled, it is quite easy to make dovecot stop responding.
Hmm. Actually now that I think of it, the ldap_bind() must give a reply before the connection is used for anything else. So if the other users can still log in, I think there's a good chance that the bug isn't in Dovecot's side.
I see no indication of a ldap_bind or ldap_search (dn or user search) _not_ returning. All ldap code seems fine to me atm (yay).
I've attached the rawlogs from a co-working of mine, who just triggered it. I hope this is the relevant portion of the log.
Cheers, Marc
participants (7)
-
Adrian Close
-
Charles Marcus
-
Gavin Henry
-
J. David Rye of Roadtech
-
J. David Rye of Roadtech
-
J.M. Maurer
-
Timo Sirainen