Thanks for the hints.
I've now moved userdb from "passwd" to "passwd-file" pointing to a munged file created overnight from our NIS password file. (We don't need the actual passwords, of course, as we're authenticating via pam and pam_ldap to Active Directory.)
Dovecot now uses no NIS, and apparently hashes/caches the passwd-file in memory making it just as quick as using userdb=static, but with the advantage that processes run as the user (and I don't need to chgrp everything). I can even tweak the mail environment per user so have some switched to Maildir, for example!
I managed 2600+ logins per minute benchmarking using "rabid" with empty mailboxes and 19 test accounts.
I'm still occassionally seeing "Login process has too old requests" but they're not causing a problem.
I've also turned on "mbox_very_dirty_syncs" which seems to have reduced the load further (half the CPU, 1/3 again of the characters read, 2/3 of the disk blocks accessed).
The biggest test starts on Monday when term officially starts and everybody (staff and students) is at work, but it's looking really good!
Best Wishes, Chris
Timo Sirainen wrote:
On Tue, 2005-09-27 at 17:30 +0100, Chris Wakelin wrote:
We've been getting more authentication problems today. This lunchtime I put in a version of 1.0-stable, including Timo's fix below, which may have helped, but still we've had, e.g:
dovecot: Sep 27 16:04:27 Warning: auth(default): Login process has too old (126s) requests, killing it. dovecot: Sep 27 16:04:27 Error: auth(default): file mech.c: line 117 (auth_request_destroy): assertion failed: (request->refcount > 0) dovecot: Sep 27 16:04:27 Error: child 21726 (auth) killed with signal 6 dovecot: Sep 27 16:04:27 Error: imap-login: Can't connect to auth server at default21726: Connection refused
dovecot-auth crashes and gets restarted, that's why these connection errors happen.
The crashing most likely happens because passdb (or maybe userdb?) lookup hangs long enough to cause Dovecot timeout the results, and the code in 1.0-stable doesn't handle that well.
I looked into these crashes last weekend but looks like they don't exist in 1.0-alphas anymore so I didn't do anything about them to 1.0-stable either.. Anyway, the overly long lookup times are the real problem you're having.
-- --+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+- Christopher Wakelin, c.d.wakelin@reading.ac.uk IT Services Centre, The University of Reading, Tel: +44 (0)118 378 8439 Whiteknights, Reading, RG6 2AF, UK Fax: +44 (0)118 975 3094