Hello,
as mentioned before, we are migrating our mailboxes from a 0.99 cluster to a 1.0.0 one. With 0.99 dovecot-auth (with LDAP as backend) was leaking quite happily and the dovecot-auth processes frequently did hit their size limit and thus were killed and restarted. Which in 0.99 at least lead to authentication failures on a busy server, as the dovecot master process just killed off the auth process w/o disabling new connections to it first and letting any current authentications finish.
The activity per cluster node on the 0.99 boxes was about 700k logins (mostly pop3) per day and 5 dovecot-auth processes with a 64MB limit each.
The migrated users so far are of the less active kind and on one node of the 1.0.0 cluster we see about 70k logins/day and currently the dovecot-auth process there is hovering at 123/84MB (VIRT/RES in top). It was at 121/82MB when I checked 4 hours ago and no new users have been added to that node for 2 days. Authentication backend is still LDAP, no caching, 256MB process size limit. There are just 16k users on that node at this point in time, with only 1200 distinct users generating those 70k logins/day.
We will start migrating the rest of the users tomorrow and now I'm wondering:
How and why would the memory footprint of dovecot-auth grow when there is no change in the amount of users in the DB?
What will happen when the single dovecot-auth process reaches 256MB in the end? Internal housekeeping attempts of that process? A whack to the head from the master process like in 0.99 and thus more erroneous authentication failures, potentially aggravated by the fact that there is just single dovecot-auth process?
I guess at the current rate of leakage we shall know the answers to #2 soon one way or another. ;)
Regards,
Christan
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Tue, 2007-06-19 at 13:40 +0900, Christian Balzer wrote:
Hello,
as mentioned before, we are migrating our mailboxes from a 0.99 cluster to a 1.0.0 one. With 0.99 dovecot-auth (with LDAP as backend) was leaking quite happily and the dovecot-auth processes frequently did hit their size limit and thus were killed and restarted. Which in 0.99 at least lead to authentication failures on a busy server, as the dovecot master process just killed off the auth process w/o disabling new connections to it first and letting any current authentications finish.
It's actually your kernel that kills the process.
- How and why would the memory footprint of dovecot-auth grow when there is no change in the amount of users in the DB?
The only thing that's changing the size of dovecot-auth is how many requests it's simultaneously handling. For example if you try to login with invalid user/pass 1000 times within a second, dovecot-auth keeps those 1000 requests in memory for a couple of seconds until it returns with failure. But this happens also with normal requests, just not for so long.
You could try http://dovecot.org/patches/debug/mempool-accounting.diff and send USR1 signal to dovecot-auth after a while. It logs how much memory is used by all existing memory pools. Each auth request has its own pool, so if it's really leaking them it's probably logging a lot of lines. If not, then the leak is elsewhere.
- What will happen when the single dovecot-auth process reaches 256MB in the end? Internal housekeeping attempts of that process? A whack to the head from the master process like in 0.99 and thus more erroneous authentication failures, potentially aggravated by the fact that there is just single dovecot-auth process?
The same as 0.99. You could also kill -HUP dovecot when dovecot-auth is nearing the limit. That makes it a bit nicer, although not perfectly safe either (should fix this some day..).
On Tue, 19 Jun 2007 14:39:59 +0300 Timo Sirainen <tss@iki.fi> wrote:
On Tue, 2007-06-19 at 13:40 +0900, Christian Balzer wrote:
- How and why would the memory footprint of dovecot-auth grow when there is no change in the amount of users in the DB?
The only thing that's changing the size of dovecot-auth is how many requests it's simultaneously handling. For example if you try to login with invalid user/pass 1000 times within a second, dovecot-auth keeps those 1000 requests in memory for a couple of seconds until it returns with failure. But this happens also with normal requests, just not for so long.
There is nowhere near anything of this kind of concurrency and memory should be returned after a while, but that is clearly not happening. The dovecot-auth is now at 200/160MB and thus prone to blow up over the weekend I guess.
You could try http://dovecot.org/patches/debug/mempool-accounting.diff and send USR1 signal to dovecot-auth after a while. It logs how much memory is used by all existing memory pools. Each auth request has its own pool, so if it's really leaking them it's probably logging a lot of lines. If not, then the leak is elsewhere.
I grabbed the Debian package source on a test machine (not gonna chance anything on the production servers), applied the patch, did add --enable-debug to the debian/rules file (and got the #define DEBUG in config.h), created the binary packages, installed, configured, started them, tested a few logins and... nothing gets logged in mail.* if I send a USR1 to dovecot-auth. Anything I'm missing?
But no matter, it is clearly leaking just as bad as 0.99 and I venture that his is the largest installation with LDAP as authentication backend. I wonder if this leak would be avoided by having LDAP lookups performed by worker processes as with SQL.
- What will happen when the single dovecot-auth process reaches 256MB in the end? Internal housekeeping attempts of that process? A whack to the head from the master process like in 0.99 and thus more erroneous authentication failures, potentially aggravated by the fact that there is just single dovecot-auth process?
The same as 0.99. You could also kill -HUP dovecot when dovecot-auth is nearing the limit. That makes it a bit nicer, although not perfectly safe either (should fix this some day..).
If that leak can't be found I would very much appreciate a solution that at least avoids failed and/or delayed logins.
Regards,
Christian
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Thu, 2007-06-21 at 16:49 +0900, Christian Balzer wrote:
You could try http://dovecot.org/patches/debug/mempool-accounting.diff and send USR1 signal to dovecot-auth after a while. It logs how much memory is used by all existing memory pools. Each auth request has its own pool, so if it's really leaking them it's probably logging a lot of lines. If not, then the leak is elsewhere.
I grabbed the Debian package source on a test machine (not gonna chance anything on the production servers), applied the patch, did add --enable-debug to the debian/rules file (and got the #define DEBUG in config.h), created the binary packages, installed, configured, started them, tested a few logins and... nothing gets logged in mail.* if I send a USR1 to dovecot-auth. Anything I'm missing?
Bug, fixed: http://hg.dovecot.org/dovecot-1.0/rev/a098e94cd318
But no matter, it is clearly leaking just as bad as 0.99 and I venture that his is the largest installation with LDAP as authentication backend. I wonder if this leak would be avoided by having LDAP lookups performed by worker processes as with SQL.
Then you'd only have multiple leaking worker processes.
The same as 0.99. You could also kill -HUP dovecot when dovecot-auth is nearing the limit. That makes it a bit nicer, although not perfectly safe either (should fix this some day..).
If that leak can't be found I would very much appreciate a solution that at least avoids failed and/or delayed logins.
That would require that login processes don't fail logins if connection to dovecot-auth drops, but instead wait until they can connect back to it and try again. And maybe another alternative would be to just disconnect the client instead of giving login failure.
On Wed, 27 Jun 2007 23:15:32 +0300 Timo Sirainen <tss@iki.fi> wrote:
On Thu, 2007-06-21 at 16:49 +0900, Christian Balzer wrote:
You could try http://dovecot.org/patches/debug/mempool-accounting.diff and send USR1 signal to dovecot-auth after a while. It logs how much memory is used by all existing memory pools. Each auth request has its own pool, so if it's really leaking them it's probably logging a lot of lines. If not, then the leak is elsewhere.
I grabbed the Debian package source on a test machine (not gonna chance anything on the production servers), applied the patch, did add --enable-debug to the debian/rules file (and got the #define DEBUG in config.h), created the binary packages, installed, configured, started them, tested a few logins and... nothing gets logged in mail.* if I send a USR1 to dovecot-auth. Anything I'm missing?
Bug, fixed: http://hg.dovecot.org/dovecot-1.0/rev/a098e94cd318
Thanks, that fixed the silence of the auth-sheep.
This is the output after start-up:
Jul 2 13:59:54 engtest03 dovecot: auth(default): pool auth request handler: 104 / 4080 bytes Jul 2 13:59:54 engtest03 last message repeated 19 times Jul 2 13:59:54 engtest03 dovecot: auth(default): pool passwd_file: 56 / 10224 bytes Jul 2 13:59:54 engtest03 dovecot: auth(default): pool Environment: 224 / 2032 bytes Jul 2 13:59:54 engtest03 dovecot: auth(default): pool ldap_connection: 576 / 1008 bytes Jul 2 13:59:54 engtest03 dovecot: auth(default): pool auth: 1520 / 2032 bytes
Used memory of dovecot-auth after 1 login was 3148KB(RSS).
This is after a good trashing with rabid (from the postal package), with just 2 users though, using POP3 logins:
Jul 2 14:12:30 engtest03 dovecot: auth(default): pool auth request handler: 104 / 4080 bytes Jul 2 14:12:30 engtest03 last message repeated 128 times Jul 2 14:12:30 engtest03 dovecot: auth(default): pool passwd_file: 56 / 10224 bytes Jul 2 14:12:30 engtest03 dovecot: auth(default): pool Environment: 224 / 2032 bytes Jul 2 14:12:30 engtest03 dovecot: auth(default): pool ldap_connection: 576 / 1008 bytes Jul 2 14:12:30 engtest03 dovecot: auth(default): pool auth: 1520 / 2032 bytes
Note that the amount of auth request handler pools have grown to 128. After another short round of rabid the handler pools grew to 137 and the size of dovecot-auth to 5100KB. The number of handler pools never fell, nor did the memory footprint, obviously. :-p
At about 800k logins/day/node here it's obvious now why dovecot-auth explodes after less than a week with max size of 512MB.
But no matter, it is clearly leaking just as bad as 0.99 and I venture that his is the largest installation with LDAP as authentication backend. I wonder if this leak would be avoided by having LDAP lookups performed by worker processes as with SQL.
Then you'd only have multiple leaking worker processes.
Yes, I realize that. But I presume since these are designed to die off and be recreated on the fly the repercussions would be much better. ;) Of course now it looks like this is not LDAP related after all.
The same as 0.99. You could also kill -HUP dovecot when dovecot-auth is nearing the limit. That makes it a bit nicer, although not perfectly safe either (should fix this some day..).
If that leak can't be found I would very much appreciate a solution that at least avoids failed and/or delayed logins.
That would require that login processes don't fail logins if connection to dovecot-auth drops, but instead wait until they can connect back to it and try again. And maybe another alternative would be to just disconnect the client instead of giving login failure.
Anything that fixes this one way or the other would be nice. ^_^
Oh and HUP'ing the master is not an option here, I guess the system load triggers a race condition in dovecot because several times when doing so I got this:
Jun 22 15:08:58 mb11 dovecot: listen(143) failed: Interrupted system call
Which results in a killed off dovecot, including all active sessions.
The self terminating dovecot-auth is not nice, but at least more predictable and does recover by itself:
Jun 30 19:03:27 mb12 dovecot: auth(default): pool_system_malloc(): Out of memory Jun 30 19:03:27 mb12 dovecot: child 11110 (auth) returned error 83 (Out of memory) Jun 30 19:03:28 mb12 dovecot: pop3-login: Can't connect to auth server at default: Resource temporarily unavailable Jun 30 19:03:28 mb12 last message repeated 11 times
Of course the 12 users that tried to log in at this time are probably not amused or at least confused.
Regards,
Christian
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Mon, 2 Jul 2007 15:20:14 +0900 Christian Balzer <chibi@gol.com> wrote:
Of course now it looks like this is not LDAP related after all.
Well, lemme rephrase that. It looks like the LDAP connection pool is not the culprit.
I just re-ran the test with local users (shadow as passdb and passwd as userdb) and while the dovecot-auth process does NOT grow the pool output is identical to the LDAP runs after the same time:
Jul 2 15:52:15 engtest03 dovecot: auth(default): pool auth request handler: 104 / 4080 bytes Jul 2 15:52:15 engtest03 last message repeated 137 times Jul 2 15:52:15 engtest03 dovecot: auth(default): pool passwd_file: 56 / 10224 bytes Jul 2 15:52:15 engtest03 dovecot: auth(default): pool Environment: 224 / 2032 bytes Jul 2 15:52:15 engtest03 dovecot: auth(default): pool auth: 1432 / 2032 bytes
So I guess back to square one, as in the leak is not happening in the pools (or at least it's not visible here), but it clearly seems to happen (just?) with LDAP.
Regards,
Christian
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Mon, 2007-07-02 at 15:20 +0900, Christian Balzer wrote:
Jul 2 14:12:30 engtest03 dovecot: auth(default): pool auth request handler: 104 / 4080 bytes Jul 2 14:12:30 engtest03 last message repeated 128 times
Auth request handler is created for each imap-login connection. So if you have 128 imap-login processes this isn't a leak.
Hmm. Does this help: http://hg.dovecot.org/dovecot-1.0/rev/50c79521e8f5
Oh and HUP'ing the master is not an option here, I guess the system load triggers a race condition in dovecot because several times when doing so I got this:
Jun 22 15:08:58 mb11 dovecot: listen(143) failed: Interrupted system call
Did you use killall? I think this happens only with it.
On Mon, 02 Jul 2007 17:37:05 +0300 Timo Sirainen <tss@iki.fi> wrote:
On Mon, 2007-07-02 at 15:20 +0900, Christian Balzer wrote:
Jul 2 14:12:30 engtest03 dovecot: auth(default): pool auth request handler: 104 / 4080 bytes Jul 2 14:12:30 engtest03 last message repeated 128 times
Auth request handler is created for each imap-login connection. So if you have 128 imap-login processes this isn't a leak.
At that point in time only POP3 was tried, since this is by far the most used protocol here and rabid defaults to it anyway. But there were plenty of pop3-login processes indeed. Enough to make up that number combined with the IMAP ones. Which is interesting, as this does NOT happen on the production servers, I guess rabid can dish out even more stress than my users (and cause these login processes to be left hanging around).
But that's not the issue anyway, with identical pool outputs the local DB incarnation retains its size (I got an internal IMAP server with 1.0.0 and PAM and a few dozen intense users which also shows no signs of a growing dovecot-auth) while the LDAP DB one keeps growing with nothing to show for in that pool debug output.
Hmm. Does this help: http://hg.dovecot.org/dovecot-1.0/rev/50c79521e8f5
Will try that tomorrow if I can.
Oh and HUP'ing the master is not an option here, I guess the system load triggers a race condition in dovecot because several times when doing so I got this:
Jun 22 15:08:58 mb11 dovecot: listen(143) failed: Interrupted system call
Did you use killall? I think this happens only with it.
Nope, this is a Debian/Linux show and I did HUP just the master process. It only happened some of the times on the (then) busiest node, but it clearly is a race condition of sorts. Set up a test environment with about 30-50 logins/second and I'm sure you can reproduce it. ;)
Regards,
Christian
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Tue, 2007-07-03 at 00:17 +0900, Christian Balzer wrote:
Jun 22 15:08:58 mb11 dovecot: listen(143) failed: Interrupted system call
Did you use killall? I think this happens only with it.
Not even with it actually. I was probably thinking about something else.
Nope, this is a Debian/Linux show and I did HUP just the master process. It only happened some of the times on the (then) busiest node, but it clearly is a race condition of sorts. Set up a test environment with about 30-50 logins/second and I'm sure you can reproduce it. ;)
Not even with 1100 logins/sec. :)
And in the code I'm already catching EINTR. Only if listen() fails 10 times with EINTR it exits with that error. And between those 10 listen() calls is a 1 second sleep. But of course if the process is getting signals all the time the sleep also fails with EINTR. But a single HUP signal shouldn't interrupt more than a single syscall, so I've really no idea what's happening.
Please find enclosed two patch files (one for configure.in and one for src/mech/auth-gssapi.c) that fixes a problem in the GSSAPI code that prevented Dovecot to successfully authenticate when using cross-realm Kerberos credentials.
Unfortunately I think it'll only work on Solaris (it uses the undocummented call __gss_userok() - however I think that there have been proposals on making that into an official one called gss_userok() - so perhaps some operating systems already have it? It basically calls the underlying krb5_userok() function that does the right stuff).
Our setup is like this:
Windows XP PC that authenticates against a normal Windows 2003 AD server in the AD realm "AD.IFM.LIU.SE"
Solaris 10 server that authenticates against a normal Kerberos KDC in the Kerberos realm "IFM.LIU.SE"
Cross-realm trust between AD.IFM.LIU.SE and IFM.LIU.SE has been configured (varios steps needed - both on the Unix side and on the Windows sides (both on the AD servers and on the client).
On the Solaris server the /etc/krb5/krb5.conf files has been configured like this: [ libdefaults] default_realm = IFM.LIU.SE
[realms] IFM.LIU.SE = { kdc = as-master.ifm.liu.se kdc = as-slave.ifm.liu.se admin_server = as-master.ifm.liu.se auth_to_local_realm = AD.IFM.LIU.SE }
AD.IFM.LIU.SE = { kdc = ad-master.ad.ifm.liu.se kdc = ad-slave.ad.ifm.liu.se admin_server = ad-master.ifm.liu.se kpasswd_protocol = SET_CHANGE }
I use Thunderbird 2 on the PC to connect to the Dovecot server with the AD-aquired credentials and with the patch above it works correctly.
Without the patch then Dovecot will fail since the code that does the GSSAPI authentication fails at gss_compare_names(). (that code only works if the credentials used are from the local realm)
(The cross-realm setup also works with Quest Putty so you can do passwordless SSH logins from your AD-connected PC to the Unix servers, and get a delegated AD ticket so that Secure NFS also works).
Storing of delegated tickets is something that my patch above doesn't solve. It shouldn't be that hard to implement though - you basically just have to call gss_store_cred() at the right place in Dovecot (when you've switched to the right userid, and preferable after having called some PAM setup functions) - can be useful if you ever want to run Dovecot and access a Secure NFS protected remote filesystem...
- Peter
Oh, another "hack" that's needed to build Dovecot on Solaris 10 with the native GSS library is a very simple, dummy, "krb5-config" script. Please find it also enclosed (probably not needed when Solaris 10 update 4 is released).
- Peter
Peter Eriksson wrote:
Please find enclosed two patch files (one for configure.in and one for src/mech/auth-gssapi.c) that fixes a problem in the GSSAPI code that
- Peter
#!/bin/ksh
while [ "$1" != "" ]; do case "$1" in --version) echo "1.4" ;;
--libs)
case "$2" in
gssapi)
echo "-lgss"
shift
;;
esac
;;
--cflags)
case "$2" in
gssapi)
echo ""
shift
;;
esac
;;
esac
shift
done
exit 0
On Mon, 2007-07-02 at 18:08 +0200, Peter Eriksson wrote:
AC_CHECK_LIB(gss, __gss_userok, [
AC_DEFINE(HAVE___GSS_USEROK,,
Define if you have __gss_userok())
])
Instead of changing $LIBS, I think you can use:
AC_CHECK_LIB(gss, __gss_userok, [ ...
],, krb5-config --libs gssapi
)
+#if 0 /* Added 070702 Peter Eriksson <peter@ifm.liu.se> - ref cyrus-sasl Solaris 8/9 has problems with NO_OID */
+#else GSS_C_NO_OID, +#endifGSS_C_NT_USER_NAME,
Is this a common problem? Should it be a setting? Although with your change this code is never even executed, right?
name = p_strndup(request->auth_request.pool,
(unsigned char *)outbuf.value + 4,
outbuf.length - 4);
if (!name) {
"gssapi",auth_request_log_error(&request->auth_request,
"Invalid response size");
auth_request_fail(&request->auth_request);
return;
}
name is never NULL.
free(name);
And you shouldn't free() it either.
Timo Sirainen skrev:
On Mon, 2007-07-02 at 18:08 +0200, Peter Eriksson wrote:
AC_CHECK_LIB(gss, __gss_userok, [
AC_DEFINE(HAVE___GSS_USEROK,,
Define if you have __gss_userok())
])
Instead of changing $LIBS, I think you can use:
AC_CHECK_LIB(gss, __gss_userok, [ ... ],,
krb5-config --libs gssapi
)
Ok, I'll try that instead (I used the changing stuff since similar stuff was done earlier (with CFLAGS) :-)
Btw. I think that the Heimdal implementation of GSSAPI has a similar function but called gss_userok() (without the leading two "_" characters) that does something similar - and users of Heimdal around that can verify or deny that?
+#if 0 /* Added 070702 Peter Eriksson <peter@ifm.liu.se> - ref cyrus-sasl Solaris 8/9 has problems with NO_OID */
+#else GSS_C_NO_OID, +#endifGSS_C_NT_USER_NAME,
Is this a common problem? Should it be a setting? Although with your change this code is never even executed, right?
Dunno actually - When I was reading the Cyrus-SASL GSSAPI-code I saw that they have a configure-check for GSS_C_NT_USER_NAME in their code base and a comment that Solaris 8/9 doesn't like GSS_C_NO_OID in that call so at first I thought that was the reason things didn't work - however it wasn't the problem I later figured out...
Anyway, as you can see I disabled that change (and with the _gcc_userid() code it's not used anyway).
name = p_strndup(request->auth_request.pool,
(unsigned char *)outbuf.value + 4,
outbuf.length - 4);
if (!name) {
"gssapi",auth_request_log_error(&request->auth_request,
"Invalid response size");
auth_request_fail(&request->auth_request);
return;
}
name is never NULL.
Ok, I'll remove those checks.
free(name);
And you shouldn't free() it either.
And I'll remove the free() calls too... :-)
I've uploaded new versions of the patches to:
ftp://ftp.ifm.liu.se/pub/unix/dovecot/
I've also put a Solaris 10 SMF manifest there (dovecot.xml).
Btw, is it intentional that the first line in configure.in starts with an "x"?
- Peter
On Mon, 2007-07-02 at 19:55 +0200, Peter Eriksson wrote:
Btw. I think that the Heimdal implementation of GSSAPI has a similar function but called gss_userok() (without the leading two "_" characters) that does something similar - and users of Heimdal around that can verify or deny that?
I don't see anything like that in Debian's heimdal-dev package (0.7.2.dfsg.1-10).
I do see gss_userok() in GNU's GSSLib though: http://www.gnu.org/software/gss/manual/html_node/Extended-GSS-API.html
ftp://ftp.ifm.liu.se/pub/unix/dovecot/
Committed with small changes: http://hg.dovecot.org/dovecot/rev/dfdedb187b26
..and.. http://hg.dovecot.org/dovecot/rev/159929f53161
Btw, is it intentional that the first line in configure.in starts with an "x"?
No. Thanks, removed.
On Mon, 02 Jul 2007 17:37:05 +0300 Timo Sirainen <tss@iki.fi> wrote:
Hmm. Does this help: http://hg.dovecot.org/dovecot-1.0/rev/50c79521e8f5
We have a winner! Auth process grows to the same size as with a local DB and stays there.
Now I just have to get this into a security maintained Debian package... (looks around for the official package masters and backports maintainers)
Am I correct in assuming that this code did not change since 0.99, read that the leak I saw for the last 4 years was the same thing? ;)
Regards,
Christian
Christian Balzer Network/Systems Engineer NOC chibi@gol.com Global OnLine Japan/Fusion Network Services http://www.gol.com/
On Tue, 2007-07-03 at 11:08 +0900, Christian Balzer wrote:
On Mon, 02 Jul 2007 17:37:05 +0300 Timo Sirainen <tss@iki.fi> wrote:
Hmm. Does this help: http://hg.dovecot.org/dovecot-1.0/rev/50c79521e8f5
We have a winner! Auth process grows to the same size as with a local DB and stays there.
Now I just have to get this into a security maintained Debian package... (looks around for the official package masters and backports maintainers)
Am I correct in assuming that this code did not change since 0.99, read that the leak I saw for the last 4 years was the same thing? ;)
Yea, I had never known about ber_free() before. Also I hadn't noticed auth process growing in my LDAP stress tests either, so I thought the problem was something else.
participants (3)
-
Christian Balzer
-
Peter Eriksson
-
Timo Sirainen