[Dovecot] Restarting dovecot-auth stops authentication
Hi,
The Dovecot wiki states that Dovecot's master restarts all died processes, which is good for availability. But when I kill dovecot/auth (to simulate an error condition which happened on a machine), the authentication fails with: Nov 16 14:32:40 be dovecot: imap: Error: net_connect_unix(auth-master) failed: No such file or directory
It seems -albeit it gets restarted- dovecot/auth doesn't re-create its socket file. Before: # ls /var/run/dovecot/ anvil auth-worker doveadm-server anvil-auth-penalty config dovecot.conf auth-client dict empty auth-login director-admin lmtp auth-master director-userdb login auth-userdb dns-client master.pid # ps auwx | grep dovecot/auth dovecot 87455 0.0 0.0 20024 3832 ?? S 2:34PM 0:00.01 dovecot/auth # rm /var/run/dovecot/auth-master; kill 87455 # ps auwx | grep dovecot/auth dovecot 88815 0.0 0.0 20024 3776 ?? S 2:36PM 0:00.01 dovecot/auth # ls /var/run/dovecot/ anvil config dovecot.conf anvil-auth-penalty dict empty auth-client director-admin lmtp auth-login director-userdb login auth-userdb dns-client master.pid auth-worker doveadm-server
I've deleted the auth-master socket because if I don't, you can't see that it's not re-created. :)
Is this a normal behaviour? I understand that killing dovecot/auth is not, but Dovecot could survive this easily, if recreating and re-using the new socket file would work. And loosing dovecot/auth happens sometimes (I don't yet now why).
Thanks,
On 11/16/10 14:40, Attila Nagy wrote:
The Dovecot wiki states that Dovecot's master restarts all died processes, which is good for availability. But when I kill dovecot/auth (to simulate an error condition which happened on a machine), the authentication fails with: Nov 16 14:32:40 be dovecot: imap: Error: net_connect_unix(auth-master) failed: No such file or directory Of course I forgot to tell it's 2.0.6. BTW, sending SIGUSR2 to dovecot/auth doesn't lot anything, while sending SIGHUP logs the "clearing cache" message. The wiki says on USR2 it should log cache statistics.
On Tue, 2010-11-16 at 14:52 +0100, Attila Nagy wrote:
Nov 16 14:32:40 be dovecot: imap: Error: net_connect_unix(auth-master) failed: No such file or directory Of course I forgot to tell it's 2.0.6.
2.0.7 fixed this.
BTW, sending SIGUSR2 to dovecot/auth doesn't lot anything, while sending SIGHUP logs the "clearing cache" message. The wiki says on USR2 it should log cache statistics.
Works here:
Nov 16 17:26:25 auth: Info: Authentication cache hits 0/2 (0%) Nov 16 17:26:25 auth: Info: Authentication cache inserts: positive: 2 95B, negative: 0 0B
So .. Since SIGHUP works, I don't really know. They should be using exactly the same code right next to each others. I guess something could disable SIGUSR2 somewhere somehow. What passdb/userdb do you use?
On 11/16/10 18:29, Timo Sirainen wrote:
On Tue, 2010-11-16 at 14:52 +0100, Attila Nagy wrote:
Nov 16 14:32:40 be dovecot: imap: Error: net_connect_unix(auth-master) failed: No such file or directory Of course I forgot to tell it's 2.0.6. 2.0.7 fixed this. Thanks, I've upgraded to it.
BTW, I have these in batches: Nov 17 11:42:10 be dovecot: pop3-login: Internal login failure (auth failed, 1 attempts): user=<user1>, method=PLAIN, rip=172.28.16.20, lip=172.16.253.13 Nov 17 11:42:10 be dovecot: pop3-login: Internal login failure (auth failed, 1 attempts): user=<user2>, method=PLAIN, rip=172.28.16.20, lip=172.16.253.13 [...] 22 from this in the same second, then nothing for hours. This time this wasn't because the auth process disappeared. I suspected LDAP errors, but Dovecot is so effective in LDAP caching that there are no 22 LDAP queries in the same second. How could I figure out what causes these errors? I don't see any more verbosity in the source code in the place, where this comes from, and I have pretty much connections, so doing a verbose log for days isn't an option... Config: # 2.0.7: /usr/local/etc/dovecot/dovecot.conf # OS: FreeBSD 8.1-STABLE amd64 auth_cache_negative_ttl = 0 auth_cache_size = 100 M auth_cache_ttl = 1 days default_process_limit = 2000 disable_plaintext_auth = no mail_fsync = never mail_gid = 999 mail_location = maildir:~/Maildir mail_plugins = " quota" mail_uid = 999 managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date passdb { args = /usr/local/etc/dovecot/dovecot-ldap.conf.ext driver = ldap } plugin { autocreate = INBOX.Trash autocreate2 = INBOX.Drafts autocreate3 = INBOX.Sent autocreate4 = INBOX.Spam autosubscribe = INBOX.Trash autosubscribe2 = INBOX.Drafts autosubscribe3 = INBOX.Sent autosubscribe4 = INBOX.Spam mail_log_events = delete undelete expunge copy mailbox_delete mailbox_rename flag_change save mailbox_create mail_log_fields = uid box msgid size flags vsize from subject quota = maildir:User quota } protocols = pop3 imap lmtp service anvil { client_limit = 8192 } service auth { client_limit = 8192 unix_listener auth-userdb { mode = 0600 user = qmailldap } } service imap-login { client_limit = 8 process_min_avail = 16 service_count = 0 vsz_limit = 64 M } service imap { client_limit = 8 process_min_avail = 16 service_count = 0 } service lmtp { inet_listener lmtp { port = 24 } user = qmailldap } service pop3-login { client_limit = 8 process_min_avail = 16 service_count = 0 } service pop3 { client_limit = 8 process_min_avail = 16 service_count = 0 } ssl = no userdb { driver = prefetch } userdb { args = /usr/local/etc/dovecot/dovecot-ldap.conf.ext driver = ldap } protocol lmtp { mail_plugins = " quota mail_log notify" } protocol imap { mail_max_userip_connections = 1024 mail_plugins = " quota imap_quota autocreate" } protocol pop3 { mail_max_userip_connections = 1024 mail_plugins = " quota autocreate" } but the process' size barely grows, regardless the large number of connections and users: dovecot 21600 0.9 0.0 32304 14604 ?? S 9:24PM 6:06.91 dovecot/auth
BTW, sending SIGUSR2 to dovecot/auth doesn't lot anything, while sending SIGHUP logs the "clearing cache" message. The wiki says on USR2 it should log cache statistics. Works here:
Nov 16 17:26:25 auth: Info: Authentication cache hits 0/2 (0%) Nov 16 17:26:25 auth: Info: Authentication cache inserts: positive: 2 95B, negative: 0 0B
So .. Since SIGHUP works, I don't really know. They should be using exactly the same code right next to each others. I guess something could disable SIGUSR2 somewhere somehow. What passdb/userdb do you use?
LDAP. procstat -i says it's OK: PID COMM SIG FLAGS 21600 auth HUP --C 21600 auth INT --C 21600 auth QUIT --- 21600 auth ILL --- 21600 auth TRAP --- 21600 auth ABRT --- [...] 21600 auth USR1 --- 21600 auth USR2 --C
On Wed, 2010-11-17 at 13:45 +0100, Attila Nagy wrote:
Nov 17 11:42:10 be dovecot: pop3-login: Internal login failure (auth failed, 1 attempts): user=<user1>, method=PLAIN, rip=172.28.16.20, lip=172.16.253.13
Something else really should have been logged just before this. An error or a warning. There were a few situations where it might not have, which are fixed by this patch: http://hg.dovecot.org/dovecot-2.0/rev/aec1f1614028
On 11/17/2010 06:55 PM, Timo Sirainen wrote:
On Wed, 2010-11-17 at 13:45 +0100, Attila Nagy wrote:
Nov 17 11:42:10 be dovecot: pop3-login: Internal login failure (auth failed, 1 attempts): user=<user1>, method=PLAIN, rip=172.28.16.20, lip=172.16.253.13 Something else really should have been logged just before this. An error or a warning. There were a few situations where it might not have, which are fixed by this patch: http://hg.dovecot.org/dovecot-2.0/rev/aec1f1614028
I'm sorry, I was in a rush and I have a lot of logs. pop3-login: Error: net_connect_unix(pop3) failed: Connection refused This precedes all the above log entries. I've checked, auth's start time is yesterday, so it wasn't restarted. I guess what remains is the resource limit (client_limit maybe?). The strange thing is according to tcpdump, dovecot does very few LDAP lookups, and they are answered in a reasonable time, but I haven't done a strict correlation with the above errors. What's the best way to find out what causes this connection refused batches once or twice a day?
How does dovecot logs timed out LDAP lookups?
On Wed, 2010-11-17 at 21:17 +0100, Attila Nagy wrote:
pop3-login: Error: net_connect_unix(pop3) failed: Connection refused
Right. This is the main problem. So the question becomes why is the connection being refused.
I've checked, auth's start time is yesterday, so it wasn't restarted. I guess what remains is the resource limit (client_limit maybe?).
If that happens, there should be a log entry. You can grep process_limit and client_limit from logs.
The strange thing is according to tcpdump, dovecot does very few LDAP lookups, and they are answered in a reasonable time, but I haven't done a strict correlation with the above errors.
This shouldn't have anything to do with authentication.
How does dovecot logs timed out LDAP lookups?
"Connection appears to be hanging, reconnecting"
I usually log errors to a different log file. Normally that file should be empty, so you can easily see all the errors and warnings that could be causing problems.
log_path = /var/log/dovecot-errors.log info_log_path = /var/log/dovecot-info.log
On Wed, 2010-11-17 at 21:17 +0100, Attila Nagy wrote:
pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Right. This is the main problem. So the question becomes why is the connection being refused. I would really love to solve this now. :) What connects here to what? pop3-login to the pop3 service? My current config is: # 2.0.8: /usr/local/etc/dovecot/dovecot.conf # OS: FreeBSD 8.2-PRERELEASE amd64 auth_cache_negative_ttl = 0 auth_cache_size = 100 M auth_cache_ttl = 1 days default_process_limit = 2000 disable_plaintext_auth = no info_log_path = syslog log_path = /var/log/dovecot-errors.log mail_fsync = never mail_gid = 999 mail_location = maildir:~/Maildir mail_plugins = " quota" mail_uid = 999 managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date
On 11/18/2010 06:45 PM, Timo Sirainen wrote: passdb { args = /usr/local/etc/dovecot/dovecot-ldap.conf.ext driver = ldap } plugin { autocreate = INBOX.Trash autocreate2 = INBOX.Drafts autocreate3 = INBOX.Sent autocreate4 = INBOX.Spam autosubscribe = INBOX.Trash autosubscribe2 = INBOX.Drafts autosubscribe3 = INBOX.Sent autosubscribe4 = INBOX.Spam mail_log_events = delete undelete expunge copy mailbox_delete mailbox_rename flag_change save mailbox_create mail_log_fields = uid box msgid size flags vsize from subject quota = maildir:User quota } protocols = pop3 imap lmtp service anvil { client_limit = 8192 } service auth { client_limit = 8192 unix_listener auth-userdb { mode = 0600 user = qmailldap } } service imap-login { client_limit = 8 process_min_avail = 16 service_count = 0 } service imap { client_limit = 8 process_min_avail = 16 service_count = 0 } service lmtp { inet_listener lmtp { port = 24 } user = qmailldap } service pop3-login { client_limit = 8 process_min_avail = 16 service_count = 0 } service pop3 { client_limit = 8 process_min_avail = 32 service_count = 0 } ssl = no userdb { driver = prefetch } userdb { args = /usr/local/etc/dovecot/dovecot-ldap.conf.ext driver = ldap } protocol lmtp { mail_plugins = " quota mail_log notify" } protocol imap { mail_max_userip_connections = 1024 mail_plugins = " quota imap_quota autocreate" } protocol pop3 { mail_max_userip_connections = 1024 mail_plugins = " quota autocreate" }
I've raised pop3's process_min_avail, but I still get these errors. There is nothing more in the error log.
I've checked, auth's start time is yesterday, so it wasn't restarted. I guess what remains is the resource limit (client_limit maybe?). If that happens, there should be a log entry. You can grep process_limit and client_limit from logs. Nothing with those. How does dovecot logs timed out LDAP lookups? "Connection appears to be hanging, reconnecting"
I usually log errors to a different log file. Normally that file should be empty, so you can easily see all the errors and warnings that could be causing problems.
log_path = /var/log/dovecot-errors.log info_log_path = /var/log/dovecot-info.log Done that, but I only have these in the errors.log: Dec 21 12:41:06 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:41:06 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:29 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Dec 21 12:43:30 pop3-login: Error: net_connect_unix(pop3) failed: Connection refused
What else should I check? I'm not completely sure about that yet, but it seems these errors come, when there are a larger amount of cache flushes towards the disks, so maybe there is a congestion in some processes. But that would be fine if the authentication waits for these flushes, but this shouldn't propagate to the user as an error.
Thanks,
On 21.12.2010, at 13.51, Attila Nagy wrote:
On 11/18/2010 06:45 PM, Timo Sirainen wrote:
On Wed, 2010-11-17 at 21:17 +0100, Attila Nagy wrote:
pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Right. This is the main problem. So the question becomes why is the connection being refused. I would really love to solve this now. :) What connects here to what? pop3-login to the pop3 service?
Yes. For each login.
service imap-login { client_limit = 8
Why imap-login with this small client_limit? The default should be ok (1000).
service imap { client_limit = 8
This definitely doesn't help. I'm not sure if setting client_limit=1 fixes this problem entirely or not, but try that first.
service pop3-login { client_limit = 8 .. service pop3 { client_limit = 8
Same for pop3.
On 12/21/2010 01:31 PM, Timo Sirainen wrote:
On 21.12.2010, at 13.51, Attila Nagy wrote:
On 11/18/2010 06:45 PM, Timo Sirainen wrote:
On Wed, 2010-11-17 at 21:17 +0100, Attila Nagy wrote:
pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Right. This is the main problem. So the question becomes why is the connection being refused. I would really love to solve this now. :) What connects here to what? pop3-login to the pop3 service? Yes. For each login. Yes, that's clear. service imap-login { client_limit = 8 Why imap-login with this small client_limit? The default should be ok (1000). Because I think that Dovecot's processes block on IO and not just on distinct IO operations, but larger tasks, like opening a maildir with a lot of e-mails without indexes. Am I wrong? Or partly wrong, because it uses blocking IO, but it can multiplex them, so while one user struggles with the file system to build indexes of his maildir, an other client in the same process can happily do POP/IMAP stuff? The rationale is to spread IO (and users) amongst processes, because the OS can schedule them concurrently, but don't have too many processes, because that eats a lot of memory and other resources.
service imap { client_limit = 8 This definitely doesn't help. I'm not sure if setting client_limit=1 fixes this problem entirely or not, but try that first. I've already done experiments with that, but had to switch context and forgot the results. Will do again.
On 21.12.2010, at 19.37, Attila Nagy wrote:
service imap-login { client_limit = 8 Why imap-login with this small client_limit? The default should be ok (1000). Because I think that Dovecot's processes block on IO and not just on distinct IO operations, but larger tasks, like opening a maildir with a lot of e-mails without indexes.
*-login processes do no disk I/O.
Am I wrong? Or partly wrong, because it uses blocking IO, but it can multiplex them, so while one user struggles with the file system to build indexes of his maildir, an other client in the same process can happily do POP/IMAP stuff? The rationale is to spread IO (and users) amongst processes, because the OS can schedule them concurrently, but don't have too many processes, because that eats a lot of memory and other resources.
This applies to imap/pop3 service, and that's why only client_limit=1 works well with them for now.
On 12/21/2010 06:41 PM, Timo Sirainen wrote:
On 21.12.2010, at 19.37, Attila Nagy wrote:
service imap-login { client_limit = 8 Why imap-login with this small client_limit? The default should be ok (1000). Because I think that Dovecot's processes block on IO and not just on distinct IO operations, but larger tasks, like opening a maildir with a lot of e-mails without indexes. *-login processes do no disk I/O. Oh, I've had problem with authentication previously, and that stuck. Removed, thanks for noticing. Am I wrong? Or partly wrong, because it uses blocking IO, but it can multiplex them, so while one user struggles with the file system to build indexes of his maildir, an other client in the same process can happily do POP/IMAP stuff? The rationale is to spread IO (and users) amongst processes, because the OS can schedule them concurrently, but don't have too many processes, because that eats a lot of memory and other resources. This applies to imap/pop3 service, and that's why only client_limit=1 works well with them for now.
Apart from these errors, it works fine. It would be interesting to see a response time statistics for both settings, to see how worse it gets with raising client_limit.
On 12/21/2010 01:31 PM, Timo Sirainen wrote:
On 21.12.2010, at 13.51, Attila Nagy wrote:
On 11/18/2010 06:45 PM, Timo Sirainen wrote:
On Wed, 2010-11-17 at 21:17 +0100, Attila Nagy wrote:
pop3-login: Error: net_connect_unix(pop3) failed: Connection refused Right. This is the main problem. So the question becomes why is the connection being refused. I would really love to solve this now. :) What connects here to what? pop3-login to the pop3 service? Yes. For each login. service imap { client_limit = 8 This definitely doesn't help. I'm not sure if setting client_limit=1 fixes this problem entirely or not, but try that first. It seems it does, at least I haven't got any connection refused errors since I set client_limit to 1 on the imap and pop3 services. BTW, the difference is about +10k open file handles, and 2-3 (judging from munin graphs, which aren't so precise) times "active" memory usage (this is on FreeBSD). Are there any chances to solve this, or should I forget running in this mode until you implement async IO?
Thanks,
participants (2)
-
Attila Nagy
-
Timo Sirainen