[dovecot] Failure in the presence of too many connections
Had an event today where dovecot refused new connections and would not accept more until restarted, whereupon it worked for a few minutes and locked up again. Looking at the logs it appeared that it had run out of file descriptors. I increased the open files limit and started it again-- which worked but it only pushes off the problem.
Does dovecot keep an open file descriptor for every imap or pop3 child? If so, this would argue more towards a tcpserver sort of control. If not, perhaps something is leaking..
-mm-
PS: assume you saw recent RFCs 3501 and 3502 ?
On Wed, 2003-03-26 at 20:28, Mark E. Mallett wrote:
Had an event today where dovecot refused new connections and would not accept more until restarted, whereupon it worked for a few minutes and locked up again. Looking at the logs it appeared that it had run out of file descriptors. I increased the open files limit and started it again-- which worked but it only pushes off the problem.
Which process ran out of fds? "dovecot"?
Does dovecot keep an open file descriptor for every imap or pop3 child?
No, except imap-login <-> imap does keep one open for SSL connections.
If so, this would argue more towards a tcpserver sort of control. If not, perhaps something is leaking..
I haven't noticed any leaks. stracing dovecot process kept the fds pretty much the same.
PS: assume you saw recent RFCs 3501 and 3502 ?
Yes, I've read both as drafts.
On Wed, Mar 26, 2003 at 08:45:03PM +0200, Timo Sirainen wrote:
On Wed, 2003-03-26 at 20:28, Mark E. Mallett wrote:
Had an event today where dovecot refused new connections and would not accept more until restarted, whereupon it worked for a few minutes and locked up again. Looking at the logs it appeared that it had run out of file descriptors. I increased the open files limit and started it again-- which worked but it only pushes off the problem.
Which process ran out of fds? "dovecot"?
Sorry for being non-specific. dovecot-auth was the program that logged the error, like this:
dovecot-auth: getpwnam(xxxx) failed: Too many open files
mm
On Wed, 2003-03-26 at 21:17, Mark E. Mallett wrote:
Sorry for being non-specific. dovecot-auth was the program that logged the error, like this:
dovecot-auth: getpwnam(xxxx) failed: Too many open files
What passdb are you using? PAM? Something is leaking fds there, but I don't think it's getpwnam() itself.
On Sat, Mar 29, 2003 at 11:05:23AM +0200, Timo Sirainen wrote:
On Wed, 2003-03-26 at 21:17, Mark E. Mallett wrote:
Sorry for being non-specific. dovecot-auth was the program that logged the error, like this:
dovecot-auth: getpwnam(xxxx) failed: Too many open files
What passdb are you using? PAM? Something is leaking fds there, but I don't think it's getpwnam() itself.
Hi-
Plain old passwd file checking.
auth = default auth_mechanisms = plain auth_userdb = passwd auth_passdb = passwd
mm
On Sat, 2003-03-29 at 16:39, Mark E. Mallett wrote:
dovecot-auth: getpwnam(xxxx) failed: Too many open files
What passdb are you using? PAM? Something is leaking fds there, but I don't think it's getpwnam() itself.
Plain old passwd file checking.
auth = default auth_mechanisms = plain auth_userdb = passwd auth_passdb = passwd
Well, that's strange. I don't think getpwnam() should leave any fds open. Auth process does keep one fd per login process, but that's also closed when client logs in or closes the connection. Maybe using passwd-file instead would help?
On 31 Mar 2003, Timo Sirainen wrote:
On Sat, 2003-03-29 at 16:39, Mark E. Mallett wrote:
dovecot-auth: getpwnam(xxxx) failed: Too many open files
What passdb are you using? PAM? Something is leaking fds there, but I don't think it's getpwnam() itself.
Plain old passwd file checking.
auth = default auth_mechanisms = plain auth_userdb = passwd auth_passdb = passwd
Well, that's strange. I don't think getpwnam() should leave any fds open. Auth process does keep one fd per login process, but that's also closed when client logs in or closes the connection. Maybe using passwd-file instead would help?
If you are running this on linux then just:
ls -l /proc/xxx/fd/*
will show you where the file descriptors are going to.
-- Charlie
If you are running this on linux then just:
ls -l /proc/xxx/fd/*
will show you where the file descriptors are going to.
no procfs here- yeah I could have done an 'fstat' or 'lsof' and traced it further but did not think to do so.
mm
On Mon, Mar 31, 2003 at 11:59:21AM +0300, Timo Sirainen wrote:
On Sat, 2003-03-29 at 16:39, Mark E. Mallett wrote:
dovecot-auth: getpwnam(xxxx) failed: Too many open files
What passdb are you using? PAM? Something is leaking fds there, but I don't think it's getpwnam() itself.
Plain old passwd file checking.
auth = default auth_mechanisms = plain auth_userdb = passwd auth_passdb = passwd
Well, that's strange. I don't think getpwnam() should leave any fds open. Auth process does keep one fd per login process, but that's also closed when client logs in or closes the connection. Maybe using passwd-file instead would help?
My suspicion (without any basis, admittedly) was that the fds were related to the interprocess communication, not the passwd file access. I was surprised when it happened, as the system ran fine for weeks before I saw this error, and then I saw it recur quickly when I stopped and restarted the dovecot processes before running it again with a higher openfiles limit. I think the fd exhaustion is related to a sudden burst of connections to the POP or IMAP services, not a long-term leakage of FDs (although I did say something like that in my initial mail).
The point was not so much that this happens-- but that it doesn't recover when it does happen (thus the subject line). The only way to make the services start responding again is to stop and restart the dovecot suite of control processes. Raising the openfiles limit certainly pushes off the problem, and maybe that's a good enough workaround (as long as there's always a higher limit availble...)
mm
On Mon, 31 Mar 2003, Mark E. Mallett wrote:
The only way to make the services start responding again is to stop and restart the dovecot suite of control processes. dovecot suite of control processes.
Or you can run things my way (under tcpserver/stunnel/imapfront-auth) and there's one dovecot process for each connection, each unrelated. This illustrates my point about re-using already well trusted simple programs to do as much of the task as possible.
Raising the openfiles limit certainly pushes off the problem, and maybe that's a good enough workaround (as long as there's always a higher limit availble...)
... which is something that you can *never* assume.
-- Charlie
On Mon, Mar 31, 2003 at 02:19:27PM -0500, Charlie Brady wrote:
On Mon, 31 Mar 2003, Mark E. Mallett wrote:
The only way to make the services start responding again is to stop and restart the dovecot suite of control processes. dovecot suite of control processes.
Or you can run things my way (under tcpserver/stunnel/imapfront-auth) and there's one dovecot process for each connection, each unrelated. This illustrates my point about re-using already well trusted simple programs to do as much of the task as possible.
yep- this kind of brings us full circle back to my original message :-) (I had mentioned that this kind of problem may be less likely with a tcpserver approach).
Although the distinction is not between using well-trusted simple programs vs large monolitic ones, but how you access those simple programs. Do you use a long-running auth process and talk to it via a UNIX socket (or other inteface), or fire up a new auth process for each need? Personally I'm with you: unless there's an awful lot of state or caching or other long-term need that you lose by creating a new auth process each time (e.g. like innd's "actived" process), I'd vote for a one-time short-running auth process.
Raising the openfiles limit certainly pushes off the problem, and maybe that's a good enough workaround (as long as there's always a higher limit availble...)
... which is something that you can *never* assume.
Amen
mm
On Mon, 2003-03-31 at 22:19, Charlie Brady wrote:
Or you can run things my way (under tcpserver/stunnel/imapfront-auth) and there's one dovecot process for each connection, each unrelated. This illustrates my point about re-using already well trusted simple programs to do as much of the task as possible.
Didn't you just say imapfront connected to separate auth process via unix socket?
On 31 Mar 2003, Timo Sirainen wrote:
On Mon, 2003-03-31 at 22:19, Charlie Brady wrote:
Or you can run things my way (under tcpserver/stunnel/imapfront-auth) and there's one dovecot process for each connection, each unrelated. This illustrates my point about re-using already well trusted simple programs to do as much of the task as possible.
Didn't you just say imapfront connected to separate auth process via unix socket?
That's one of the options, but you can also run a standalone command. See http://untroubled.org/cvm/cvm.html.
In any case, these are still well trusted simple programs :-)
-- Charlie
On Mon, 2003-03-31 at 19:51, Mark E. Mallett wrote:
The point was not so much that this happens-- but that it doesn't recover when it does happen (thus the subject line). The only way to make the services start responding again is to stop and restart the dovecot suite of control processes.
I don't really see anything in code that would prevent dovecot-auth from working again after some of the connections to login processes die and free file descriptors. I guess I'll have to try myself.
Raising the openfiles limit certainly pushes off the problem, and maybe that's a good enough workaround (as long as there's always a higher limit availble...)
How much was it before? Dovecot limits the number of login processes to 384 by default (max_logging_users 256 + login_max_processes_count 128).
Raising the openfiles limit certainly pushes off the problem, and maybe that's a good enough workaround (as long as there's always a higher limit availble...)
How much was it before? Dovecot limits the number of login processes to 384 by default (max_logging_users 256 + login_max_processes_count 128).
openfiles limit was 64 when the issue occured. I then raised it to some much larger value.
participants (3)
-
Charlie Brady
-
Mark E. Mallett
-
Timo Sirainen