[Dovecot] Can't connect to auth server at default: Connection refused
We've being running with Dovecot 1.0-stable since Tuesday, and on the whole it's gone very well (meaning nobody's noticed!). My best estimate is we're using half the disk accesses and 2/3 the CPU of UW-IMAP/Qpopper.
We've quite a mixture of error messages in the logs, the most worrying of which is that several times we've had a few seconds of "xxx-login: Can't connect to auth server at default: Connection refused" (where xxx is pop or imap). It seems to sort itself out shortly afterwards. We're using userdb=passwd, passdb=passwd (authenticating via NIS). From Tuesday we'll be using passdb=pam (pam_ldap to Active Directory).
My first guess is that some dovecot-auth process stops responding and gets killed off. Is this right?
Other errors we're seeing include presumably some of these are fixed in alpha2.
- pread_full() failed with mbox file /export/mail/c/37/sau04djp/INBOX: No such file or directory (when there is!)
- mbox sync: UID inserted in the middle of mailbox (especially with .prayer folders - configuration folders for our Webmail)
- file mbox-lock.c: line 505 (mbox_lock): assertion failed: (lock_type == F_RDLCK || ibox->mbox_lock_type != F_RDLCK)
Best Wishes, Chris
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Sun, 2005-09-18 at 16:02 +0100, Chris Wakelin wrote:
We've being running with Dovecot 1.0-stable since Tuesday, and on the whole it's gone very well (meaning nobody's noticed!). My best estimate is we're using half the disk accesses and 2/3 the CPU of UW-IMAP/Qpopper.
We've quite a mixture of error messages in the logs, the most worrying of which is that several times we've had a few seconds of "xxx-login: Can't connect to auth server at default: Connection refused" (where xxx is pop or imap). It seems to sort itself out shortly afterwards.
Probably means that dovecot-auth is getting a lot of connections at the time and kernel just returns connection refused. With 1.0alphas I've already done one change for this, committing to 1.0-stable CVS too: diff -u -r1.1.1.1 auth-server-connection.c --- src/lib-auth/auth-server-connection.c 31 Jan 2005 16:37:55 -0000 1.1.1.1 +++ src/lib-auth/auth-server-connection.c 23 Sep 2005 12:49:57 -0000 @@ -192,9 +192,17 @@ struct auth_server_connection *conn; const char *handshake; pool_t pool; - int fd; + int fd, try; - fd = net_connect_unix(path); + /* max. 1 second wait here. */ + for (try = 0; try < 10; try++) { + fd = net_connect_unix(path); + if (fd != -1 || errno != EAGAIN) + break; + + /* busy. wait for a while. */ + usleep(((rand() % 10) + 1) * 10000); + } if (fd == -1) { i_error("Can't connect to auth server at %s: %m", path); return NULL; Another thing that helps is to increase LISTEN_BACLOG in src/lib/network.c (or maybe just in net_listen_unix()). Looks like with 1.0alpha I'm using 16, but even that should probably be increased.. Changed to 64 now.
Other errors we're seeing include 1) pread_full() failed with mbox file /export/mail/c/37/sau04djp/INBOX: No such file or directory (when there is!)
The error message is wrong. Should have been "X-IMAPbase uid-last unexpectedly points outside mbox file". Some problems related this this are fixed in 1.0alphas without having been backported to 1.0stable..
2) mbox sync: UID inserted in the middle of mailbox (especially with .prayer folders - configuration folders for our Webmail)
These are always a bit strange. Some might be fixed in 1.0alpha, but I know there are still some problems left..
3) file mbox-lock.c: line 505 (mbox_lock): assertion failed: (lock_type == F_RDLCK || ibox->mbox_lock_type != F_RDLCK)
I still see this in 1.0alpha too. One way to cause it is to copy mail from a mailbox within itself, but there probably are some other ways too.
We've been getting more authentication problems today. This lunchtime I put in a version of 1.0-stable, including Timo's fix below, which may have helped, but still we've had, e.g:
dovecot: Sep 27 16:04:27 Warning: auth(default): Login process has too old (126s) requests, killing it. dovecot: Sep 27 16:04:27 Error: auth(default): file mech.c: line 117 (auth_request_destroy): assertion failed: (request->refcount > 0) dovecot: Sep 27 16:04:27 Error: child 21726 (auth) killed with signal 6 dovecot: Sep 27 16:04:27 Error: imap-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:04:27 Error: imap-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:04:27 Error: pop3-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:04:27 Error: pop3-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:04:27 Error: pop3-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:04:27 Error: pop3-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:04:28 Error: imap-login: Can't connect to auth server at default21726: Connection refused dovecot: Sep 27 16:05:00 Error: child 21727 (login) returned error 89 dovecot: Sep 27 16:05:00 Error: imap-login: Master sent reply with unknown tag 500 dovecot: Sep 27 16:05:00 Error: child 27189 (auth) killed with signal 11 dovecot: Sep 27 16:08:41 Warning: auth(default): Login process has too old (123s) requests, killing it. dovecot: Sep 27 16:08:41 Error: child 27747 (auth) killed with signal 11 dovecot: Sep 27 16:12:01 Warning: auth(default): Login process has too old (122s) requests, killing it. dovecot: Sep 27 16:15:02 Error: child 24460 (login) returned error 89 dovecot: Sep 27 16:15:02 Error: imap-login: Master sent reply with unknown tag 551 dovecot: Sep 27 16:15:18 Error: auth(default): file mech.c: line 117 (auth_request_destroy): assertion failed: (request->refcount > 0) dovecot: Sep 27 16:15:19 Error: child 4051 (auth) killed with signal 6 dovecot: Sep 27 16:21:44 Error: imap-login: Master sent reply with unknown tag 1106 dovecot: Sep 27 16:21:44 Error: pop3-login: Master sent reply with unknown tag 399 dovecot: Sep 27 16:21:44 Error: child 20665 (login) returned error 89 dovecot: Sep 27 16:21:44 Error: child 27748 (login) returned error 89 dovecot: Sep 27 16:22:03 Error: imap-login: Master sent reply with unknown tag 517 dovecot: Sep 27 16:22:03 Error: child 9735 (login) returned error 89 dovecot: Sep 27 16:22:07 Error: child 9950 (auth) killed with signal 11 dovecot: Sep 27 16:23:38 Error: child 20590 (login) returned error 89 dovecot: Sep 27 16:23:38 Error: imap-login: Master sent reply with unknown tag 68 dovecot: Sep 27 16:23:38 Warning: auth(default): Authentication client 20590: Transmit buffer full, killing it dovecot: Sep 27 16:23:39 Error: child 20599 (auth) killed with signal 11 dovecot: Sep 27 16:49:33 Warning: auth(default): Login process has too old (137s) requests, killing it. dovecot: Sep 27 16:49:33 Error: auth(default): file mech.c: line 117 (auth_request_destroy): assertion failed: (request->refcount > 0) dovecot: Sep 27 16:49:33 Error: child 21549 (auth) killed with signal 6 dovecot: Sep 27 16:52:05 Warning: auth(default): Login process has too old (122s) requests, killing it.
We've noticed that the server "feels" sluggish (as in typing into an ssh window is laggy) but top, vmstat etc. reckon it's 85% idle! I guess we're hitting contention somewhere subtle (the system is a 4-CPU Sun Enterprise 450, running Solaris 8).
I've played around with "rabid" (part of "postal") load-generator on an Ultra 5 just doing POP logins (no downloads) and the best performance was with Dovecot-1.0.alpha3 with authentication cache enabled and using userdb set to "static" (so eliminating NIS lookups and caching the PAM to LDAP to Active Directory ones).
We could look at upgrading to alpha3, but we've not tested it with lots of different users yet. We could also make the server a NIS slave to see if that helps, and play around with nscd (like turn it off!).
Anybody have any ideas?
Chris
Timo Sirainen wrote:
On Sun, 2005-09-18 at 16:02 +0100, Chris Wakelin wrote:
We've being running with Dovecot 1.0-stable since Tuesday, and on the whole it's gone very well (meaning nobody's noticed!). My best estimate is we're using half the disk accesses and 2/3 the CPU of UW-IMAP/Qpopper.
We've quite a mixture of error messages in the logs, the most worrying of which is that several times we've had a few seconds of "xxx-login: Can't connect to auth server at default: Connection refused" (where xxx is pop or imap). It seems to sort itself out shortly afterwards.
Probably means that dovecot-auth is getting a lot of connections at the time and kernel just returns connection refused. With 1.0alphas I've already done one change for this, committing to 1.0-stable CVS too:
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
On Tue, 2005-09-27 at 17:30 +0100, Chris Wakelin wrote:
We've been getting more authentication problems today. This lunchtime I put in a version of 1.0-stable, including Timo's fix below, which may have helped, but still we've had, e.g:
dovecot: Sep 27 16:04:27 Warning: auth(default): Login process has too old (126s) requests, killing it. dovecot: Sep 27 16:04:27 Error: auth(default): file mech.c: line 117 (auth_request_destroy): assertion failed: (request->refcount > 0) dovecot: Sep 27 16:04:27 Error: child 21726 (auth) killed with signal 6 dovecot: Sep 27 16:04:27 Error: imap-login: Can't connect to auth server at default21726: Connection refused
dovecot-auth crashes and gets restarted, that's why these connection errors happen.
The crashing most likely happens because passdb (or maybe userdb?) lookup hangs long enough to cause Dovecot timeout the results, and the code in 1.0-stable doesn't handle that well.
I looked into these crashes last weekend but looks like they don't exist in 1.0-alphas anymore so I didn't do anything about them to 1.0-stable either.. Anyway, the overly long lookup times are the real problem you're having.
Thanks for the hints.
I've now moved userdb from "passwd" to "passwd-file" pointing to a munged file created overnight from our NIS password file. (We don't need the actual passwords, of course, as we're authenticating via pam and pam_ldap to Active Directory.)
Dovecot now uses no NIS, and apparently hashes/caches the passwd-file in memory making it just as quick as using userdb=static, but with the advantage that processes run as the user (and I don't need to chgrp everything). I can even tweak the mail environment per user so have some switched to Maildir, for example!
I managed 2600+ logins per minute benchmarking using "rabid" with empty mailboxes and 19 test accounts.
I'm still occassionally seeing "Login process has too old requests" but they're not causing a problem.
I've also turned on "mbox_very_dirty_syncs" which seems to have reduced the load further (half the CPU, 1/3 again of the characters read, 2/3 of the disk blocks accessed).
The biggest test starts on Monday when term officially starts and everybody (staff and students) is at work, but it's looking really good!
Best Wishes, Chris
Timo Sirainen wrote:
On Tue, 2005-09-27 at 17:30 +0100, Chris Wakelin wrote:
We've been getting more authentication problems today. This lunchtime I put in a version of 1.0-stable, including Timo's fix below, which may have helped, but still we've had, e.g:
dovecot: Sep 27 16:04:27 Warning: auth(default): Login process has too old (126s) requests, killing it. dovecot: Sep 27 16:04:27 Error: auth(default): file mech.c: line 117 (auth_request_destroy): assertion failed: (request->refcount > 0) dovecot: Sep 27 16:04:27 Error: child 21726 (auth) killed with signal 6 dovecot: Sep 27 16:04:27 Error: imap-login: Can't connect to auth server at default21726: Connection refused
dovecot-auth crashes and gets restarted, that's why these connection errors happen.
The crashing most likely happens because passdb (or maybe userdb?) lookup hangs long enough to cause Dovecot timeout the results, and the code in 1.0-stable doesn't handle that well.
I looked into these crashes last weekend but looks like they don't exist in 1.0-alphas anymore so I didn't do anything about them to 1.0-stable either.. Anyway, the overly long lookup times are the real problem you're having.
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094
participants (2)
-
Chris Wakelin
-
Timo Sirainen