[Dovecot] Dovecot 1.1.11 failing to accept SSL/TLS connections
Greetings - We are running Dovecot 1.1.11 on our servers and have been gradually migrating people off our old (UW-based) systems onto the new (Dovecot-based) ones. As a result the new systems are seeing more connections from more people.
We have started seeing problems reported by users of our webmail service that they are getting an error indicating the webmail software (Prayer) has failed to establish an IMAP connection using TLS to Dovecot. Investigations show it is not just the webmail service that is affected but all mail clients: it's just other clients retry the connection, whereas Prayer fails the login and shows the error message:
TLS/SSL failure for username.imap.york.ac.uk: SSL negotiation failed
It seems to be related to one of Dovecot's imap-login processes getting a lot of file descriptors in use. We initially spawn off 10 imap-login processes and have each handling 256 connections. Full "dovecot -n" at the end of the message, but the relevant settings here are:
login_process_size = 64 login_process_per_connection = no login_processes_count = 10 login_max_processes_count = 128 login_max_connections = 256
We have grown to suspect it is to do with one of the imap-login processes having a large number of files open. Killing the process seems to get rid of the problem.
For example currently we have 11 imap-login processes running, one of which has 518 open files -- process 12436 in the list below. I suspect that, as in previous times we've encountered the problem, killing this process will alleviate the problem. (I'll be doing this later on this evening.)
It is slightly odd that the imap-login processes have a very skewed distribution of open files, almost as if the algorithm for allocating connections to a process favours some over others. For example the current counts of open files are:
Pid = Open files count
12430 = 42 26818 = 237 12431 = 90 12433 = 12 12438 = 304 12437 = 106 12435 = 190 12432 = 14 12436 = 518 12434 = 32 12429 = 12
Process 12436 was one of the 10 imap-login processes initially created back on March 3rd. (Process 26818 was the additional imap-login process spawned a little later on March 4th.)
I don't believe the problem lies with the available file descriptors on the system: in the script which starts Dovecot I do
date >/var/run/dovecot-limits ulimit -Sa >>/var/run/dovecot-limits echo "----------" >>/var/run/dovecot-limits ulimit -Ha >>/var/run/dovecot-limits ulimit -Sn 10128 echo "==========" >>/var/run/dovecot-limits ulimit -Sa >>/var/run/dovecot-limits echo "----------" >>/var/run/dovecot-limits ulimit -Ha >>/var/run/dovecot-limits $DOVECOT >>/var/run/dovecot-limits 2>&1
(The magic 10128 number came from Dovecot 1.1.11 itself, complaining that the number I had originally was too low.)
Likewise the output of the pfiles command on process 12436 (which is the one I believe to be problematic) indicates its limit still has some available -- I'm guessing Dovecot has reduced the limit down to 533 from the 10128 set in the startup script:
Current rlimit: 533 file descriptors
We originally saw this problem with Dovecot 1.0.3 which we were running up until a couple of weeks ago. From there I upgraded first to 1.0.15 and then to 1.1.11. I'd hoped that upgrading would fix the problem: I see it has been mentioned before on the list, but not for a year or two.
Can anyone give any help, please?
Cheers, Mike Brudenell
Configuration info
Platform: Solaris 10
dovecot -n output is : # 1.1.11: /usr/local/dovecot-1.1.11/etc/dovecot.conf # OS: SunOS 5.10 i86pc ufs base_dir: /var/run/dovecot/ log_path: /logfiles/mail/live/dovecot listen: *:143 ssl_listen: *:993 ssl_cert_file: /usr/local/dovecot/ssl/certs/imapservice-bundle-2007.pem ssl_key_file: /usr/local/dovecot/ssl/private/imapservice-key-2007.pem disable_plaintext_auth: no shutdown_clients: no login_dir: /var/run/dovecot/login login_executable: /usr/local/dovecot-1.1.11/libexec/dovecot/imap-login login_log_format_elements: user=<%Lu> method=%m rip=%r lip=%l %c login_process_per_connection: no login_greeting_capability: yes login_processes_count: 10 max_mail_processes: 10000 mail_max_userip_connections: 20 mail_location: maildir:/mailstore/messages/%1Ln/%Ln/Maildir:INDEX=/mailstore/index/%1Ln/%Ln:CONTROL=/mailstore/control/%1Ln/%Ln mail_plugins: quota imap_quota fts fts_squat mail_log_prefix: [%p]%Us(%Lu): imap_client_workarounds: delay-newmail namespace: type: private separator: / inbox: yes list: yes subscriptions: yes auth default: mechanisms: plain login cache_size: 1024 cache_ttl: 600 username_chars: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890 username_format: %Ln passdb: driver: shadow userdb: driver: passwd plugin: fts: squat fts_squat: partial=4 full=4 quota: fs:user
On Mon, 2009-03-09 at 17:41 +0000, Mike Brudenell wrote:
We have grown to suspect it is to do with one of the imap-login processes having a large number of files open. Killing the process seems to get rid of the problem.
You didn't mention if you actually saw "Too many open files" errors in log file. If you didn't, the problem isn't with ulimits.
It is slightly odd that the imap-login processes have a very skewed distribution of open files, almost as if the algorithm for allocating connections to a process favours some over others.
That's because of kernel. Dovecot doesn't currently even attempt to distribute the connections. Instead each process simply waits for new connections and the process that's the quickest gets it.
Likewise the output of the pfiles command on process 12436 (which is the one I believe to be problematic) indicates its limit still has some available -- I'm guessing Dovecot has reduced the limit down to 533 from the 10128 set in the startup script:
Current rlimit: 533 file descriptors
Yes, v1.1 drops the number of fds to the maximum number that it needs. Since you had login_max_connections=256, it doesn't need more than twice as much of them. The 12436 process probably was very close to the 256 connections, and after reaching that it would have stopped accepting more.
But there do seem to be bugs related to reaching login_max_connections. I'm not entirely sure what bugs exactly though. It's just better not to reach it. Perhaps you should change the settings to something like:
login_processes_count = 2 login_max_connections = 1024
login_processes_count should be about the same as the number of CPUs/cores on the system (maybe +1).
Hi! Sorry for the delay in replying: I was waiting for the problem to recur so I could double-check the logs and the states of the imap-login processes.
2009/3/13 Timo Sirainen tss@iki.fi
On Mon, 2009-03-09 at 17:41 +0000, Mike Brudenell wrote:
We have grown to suspect it is to do with one of the imap-login processes having a large number of files open. Killing the process seems to get rid of the problem.
You didn't mention if you actually saw "Too many open files" errors in log file. If you didn't, the problem isn't with ulimits.
No, there's no sign of the "Too many open files" error message in the logfiles.
Likewise the output of the pfiles command on process 12436 (which is the one
I believe to be problematic) indicates its limit still has some available
I'm guessing Dovecot has reduced the limit down to 533 from the 10128 set in the startup script:
Current rlimit: 533 file descriptors
Yes, v1.1 drops the number of fds to the maximum number that it needs. Since you had login_max_connections=256, it doesn't need more than twice as much of them. The 12436 process probably was very close to the 256 connections, and after reaching that it would have stopped accepting more.
Ah, I see.
When I upgraded from 1.0.15 I had 1.1.11 telling me off for having the fd limit set too low at 2048 when I started Dovecot. Instead it told me to raise the limit to at least 10128, so I did. Hence I was a bit surprised to find the limit lowered down to 533 if it had told me it wanted the higher number.
But there do seem to be bugs related to reaching login_max_connections. I'm not entirely sure what bugs exactly though. It's just better not to reach it. Perhaps you should change the settings to something like:
login_processes_count = 2 login_max_connections = 1024
login_processes_count should be about the same as the number of CPUs/cores on the system (maybe +1).
We're running a pair of servers, each with 8 CPUs. So I'm guessing my
login_processes_count = 10
should be OK?
The servers are handling a LOT of client machines. For example I've just checked the two machines and as I write there are 1881 "imap" processes on one, and 1808 on the other.
I'm guessing that if I increase login_max_connections from its current 256 to 1024 this might delay the problem occurring? And perhaps if I were restart Dovecot in the small hours of the night every few days?
Or is an alternative workaround to change login_process_per_connection from no to yes?
...If I were to do this am I right in thinking that imap-login then plays no part in SSL-connected IMAP sessions? As it's imap-login that seems to be having the problem, anything I can do ti reduce the number of connections its handling would presumably help?
If it's any help in working out what the problem might be I have the output from the Solaris "pfiles" command, which lists all the open files each process has. The output for a "rogue" imap-login process shows lots of these as being S_IFSOCK and connected to clients as expected. There are also lots which are AF_UNIX as well -- I'm guessing the proxying of SSL connections through imap-login to the imap process? I can send you (Timo) this file privately if you think it might help any.
Cheers, Mike B-)
Hi again...
2009/3/19 Mike Brudenell pmb1@azilo.me.uk
2009/3/13 Timo Sirainen tss@iki.fi
We have grown to suspect it is to do with one of the imap-login
On Mon, 2009-03-09 at 17:41 +0000, Mike Brudenell wrote: processes
having a large number of files open. Killing the process seems to get rid of the problem.
You didn't mention if you actually saw "Too many open files" errors in log file. If you didn't, the problem isn't with ulimits.
No, there's no sign of the "Too many open files" error message in the logfiles.
However I have just spotted that this is getting logged each time it happens:
dovecot: Mar 19 13:29:34 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N
There appears to be one of these for each failed connection attempt to the IMAP server.
Grepping the code suggests this is presumably coming from src/imap-login/client.c
I'm now peering at the code trying to see if I can spot anything, but have to confess to not being too wonderful in the area of sockets etc.
Cheers, Mike B-)
Hi (again!) Timo, 2009/3/19 Mike Brudenell pmb1@azilo.me.uk
Grepping the code suggests this is presumably coming from src/imap-login/client.c
I'm now peering at the code trying to see if I can spot anything, but have to confess to not being too wonderful in the area of sockets etc.
Cheers, Mike B-)
I've been peering at the code in src/imap-login/client.c and in particular at client_destroy_oldest() trying to see if there's a way it could end up killing itself, thinking it was the oldest.
So far I've not got anywhere, other than giving myself a headache, and it's time for me to go home now.
Just one slight oddity I can't quite fathom...
If I understand client_destroy_oldest() correctly it should destroy a number of clients when it gets called. In our case login_max_connections is 256, so I'm thinking that
destroy_count = max_connections > CLIENT_DESTROY_OLDEST_COUNT*2 ?
CLIENT_DESTROY_OLDEST_COUNT : I_MIN(max_connections/2, 1);
should set destroy_count to CLIENT_DESTROY_OLDEST_COUNT, which is 16. So I was expecting to see 16 messages in the log saying "Disconnected: Connection queue full": one as each client was disconnected, and that these would therefore appear in very quick succession.
Yet in reality although the message is logged frequently I wouldn't say it was a quick burst of 16. I'm including an extract of these grepped from the log (with IP addresses obfuscated for privacy) so you can see what I mean from their timestamps:
% fgrep 'Connection queue full' dovecot | tail -32 dovecot: Mar 19 13:34:55 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:34:55 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:34:57 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:34:57 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:34:57 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS handshaking dovecot: Mar 19 13:35:04 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=82.132.N.N, lip=144.32.N.N, TLS handshaking dovecot: Mar 19 13:35:09 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=82.132.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:35:09 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:06 Info: imap-login: Disconnected: Connection queue full (auth failed, 1 attempts): method=PLAIN, rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:08 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:11 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:15 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:15 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:31 Info: imap-login: Disconnected: Connection queue full (auth failed, 1 attempts): user=<USERNAME>, method=PLAIN, rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:31 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:36:54 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:36:54 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:37:02 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:37:26 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:37:39 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=82.132.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:37:50 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:38:08 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:38:08 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=86.129.N.N, lip=144.32.N.N, TLS handshaking dovecot: Mar 19 13:38:09 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS dovecot: Mar 19 13:38:09 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS handshaking dovecot: Mar 19 13:38:18 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:38:20 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:38:27 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:38:30 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:38:30 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N, TLS handshaking dovecot: Mar 19 13:38:35 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N dovecot: Mar 19 13:38:36 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N %
Just thought I'd mention it in case it gives you any ideas.
Cheers, Mike B-)
On Thu, 2009-03-19 at 16:56 +0000, Mike Brudenell wrote:
destroy_count = max_connections > CLIENT_DESTROY_OLDEST_COUNT*2 ? CLIENT_DESTROY_OLDEST_COUNT : I_MIN(max_connections/2, 1);
should set destroy_count to CLIENT_DESTROY_OLDEST_COUNT, which is 16. So I was expecting to see 16 messages in the log saying "Disconnected: Connection queue full": one as each client was disconnected, and that these would therefore appear in very quick succession.
Yet in reality although the message is logged frequently I wouldn't say it was a quick burst of 16.
I'm guessing that there just aren't many imap_clients. That most connections are SSL connections that are being proxied and there's probably just 1..few actual logging in clients.
Wonder if this helps: http://hg.dovecot.org/dovecot-1.1/rev/c91b73564611
On Thu, 2009-03-19 at 15:10 +0000, Mike Brudenell wrote:
dovecot: Mar 19 13:29:34 Info: imap-login: Disconnected: Connection queue full (no auth attempts): rip=144.32.N.N, lip=144.32.N.N
This is the reason why I suggested increasing the max connection count. There has to be some bug related to this, because it's supposed to stop listening for new connections when it reaches the limit, unless master tells that all processes are full.. So either the bug is related to master process telling that everyone is full, or the process internally doesn't stop listening early enough. What about with this patch, does is start logging those warnings? diff -r d06f56f82928 src/login-common/master.c --- a/src/login-common/master.c Wed Mar 18 17:36:41 2009 -0400 +++ b/src/login-common/master.c Thu Mar 19 13:28:35 2009 -0400 @@ -41,6 +41,7 @@ if (reply->tag == 0 && !process_per_connection) { /* this means we have to start listening again. we've reached maximum number of login processes. */ + i_warning("All login processes are full, starting to kill old connections"); main_listen_start(); return; }
On Thu, 2009-03-19 at 14:01 +0000, Mike Brudenell wrote:
When I upgraded from 1.0.15 I had 1.1.11 telling me off for having the fd limit set too low at 2048 when I started Dovecot. Instead it told me to raise the limit to at least 10128, so I did. Hence I was a bit surprised to find the limit lowered down to 533 if it had told me it wanted the higher number.
It's the "dovecot" process that has trouble with the fd limits, because it needs one fd for each child process (for logging).
login_processes_count should be about the same as the number of CPUs/cores on the system (maybe +1).
We're running a pair of servers, each with 8 CPUs. So I'm guessing my
login_processes_count = 10
should be OK?
Seems like a good value.
I'm guessing that if I increase login_max_connections from its current 256 to 1024 this might delay the problem occurring?
I was originally thinking that if you used fewer login processes the load would be more balanced between processes and might never reach the max limits then.
And perhaps if I were restart Dovecot in the small hours of the night every few days?
kill -HUP dovecot would be enough.
Or is an alternative workaround to change login_process_per_connection from no to yes?
That would cause you to get thousands of login processes. Probably not a good idea.
...If I were to do this am I right in thinking that imap-login then plays no part in SSL-connected IMAP sessions?
imap-login always proxies SSL connections. Then you'd have one process proxying one SSL connection.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, 19 Mar 2009, Timo Sirainen wrote:
It's the "dovecot" process that has trouble with the fd limits, because it needs one fd for each child process (for logging).
Dunno if it is related or helpful, I use v1.10 still. When I had login_max_processes_count too low, once in a while the dovecot process had lots of "socket" fd's (displayed with lsof in Linux), for which there was no counterpart in the system (lsof -U). The number of fd's did not decrease, when the number of logged in users decreased (aka over night) - I made a graph for monitoring fd's and users count. Doing a "reload" solved the problem just temporarily, we have peeks of user logins at specific parts of the year, so the problem came up every 4 or 5 days in these phases, but not in months otherwise.
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBScNIx3WSIuGy1ktrAQLy5wgApxdglt2/d6CBxwiIiKN+DXCJPPrK1HqC 1ELyWAg2eR6jb9xo1DprJWZgs/7fFMQZvcg4vBMPpsr9AZl3O22tTaDLNz9MglN+ vYIh2IXFqNzOPY9NXSM3v5S7Y7WqwRALfmMXcbQ1UGLn9CdnsPDvAaqEY26dWSee Fyjvgnxl035o2glvIJ5ppDwjOSnPL2hEYL34CprjzglLjzNuyh2l/oK/xhOadhwa Dswnr30uKZYpIVcVFHVom+y2P1X66noitaKZsTjAMgmJtKrHsFRahP6EQ+0qYGOf KTHyWBmStzxZazTTWrMuWdJwxn/B+W4i4x6nTK+5y+/wGx1/kcjW5A== =NUIC -----END PGP SIGNATURE-----
participants (3)
-
Mike Brudenell
-
Steffen Kaiser
-
Timo Sirainen