[Dovecot] MySQL server has gone away
I'm running 2.0.17 and I'm still seeing a decent amount of "MySQL server has gone away" errors, despite having multiple hosts defined in my auth userdb 'connect'. This is Debian Lenny 32-bit and I'm seeing the same thing with 2.0.16 on Debian Squeeze 64-bit.
E.g.:
Jan 12 20:30:33 auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
Our mail mysql servers are busy enough that wait_timeout is set to a whopping 30 seconds. On my regular boxes, I see a good deal of these in the logs. I've been doing a lot of mucking with doveadm/dsync (working on maildir->mdbox migration finally, yay!) on test boxes (same dovecot package & version) and when I get this error, despite the log saying it's retrying, it doesn't seem to be. Instead I get:
dsync(root): Error: user ...: Auth USER lookup failed dsync(root): Fatal: User lookup failed: Internal error occurred. Refer to server log for more information.
Watching tcpdump at the same time, it looks like it's going through some of the mysql servers, but all of them have by now disconnected and are in CLOSE_WAIT.
Here's an (edited) example after doing a dsync that completes without errors, with tcpdump running in the background:
# sleep 30; netstat -ant | grep 3306; dsync -C^ -u mailbox@test.com backup mdbox:~/mdbox
tcp 1 0 10.1.15.129:57436 10.1.52.48:3306 CLOSE_WAIT tcp 1 0 10.1.15.129:49917 10.1.52.49:3306 CLOSE_WAIT tcp 1 0 10.1.15.129:35904 10.1.52.47:3306 CLOSE_WAIT
20:49:59.725005 IP 10.1.15.129.35904 > 10.1.52.47.3306: F 1126:1126(0) ack 807 win 1004 <nop,nop,timestamp 312603858 766667259> 20:49:59.725459 IP 10.1.52.47.3306 > 10.1.15.129.35904: . ack 1127 win 123 <nop,nop,timestamp 766667998 312603858> 20:49:59.725568 IP 10.1.15.129.57436 > 10.1.52.48.3306: F 1126:1126(0) ack 807 win 1004 <nop,nop,timestamp 312603858 1842560856> 20:49:59.725779 IP 10.1.52.48.3306 > 10.1.15.129.57436: . ack 1127 win 123 <nop,nop,timestamp 1842561225 312603858>
dsync(root): Error: user mailbox@test.com: Auth USER lookup failed dsync(root): Fatal: User lookup failed: Internal error occurred. Refer to server log for more information.
10.1.15.129 in this case is the dovecot server, and the 10.1.52.0/24 boxes are mysql servers. That's the same pattern I've seen almost every time. Just a FIN packet to two of the servers (ack'd by the mysql server) and then it fails.
Is the retry mechanism supposed to transparently start a new connection, or is this how it works? In connecting remotely to these same servers (which aren't getting production traffic, so I'm the only person connecting to them), I get seemingly random disconnects via IMAP, always coinciding with a "MySQL server has gone away" error in the logs.
This is non-production, so I'm happy to turn on whatever debugging would be useful.
Here's doveconf -n from the box the tcpdump was on. This box is just configured for lmtp (but have seen the same thing on one configured for IMAP/POP as well), so it's pretty small, config-wise:
# 2.0.17: /etc/dovecot/dovecot/dovecot.conf # OS: Linux 3.0.9-nx i686 Debian 5.0.9 auth_cache_negative_ttl = 0 auth_cache_ttl = 0 auth_debug = yes auth_failure_delay = 0 base_dir = /var/run/dovecot/ debug_log_path = /var/log/dovecot/debug.log default_client_limit = 3005 default_internal_user = doveauth default_process_limit = 1500 deliver_log_format = M=%m, F=%f, S="%s" => %$ disable_plaintext_auth = no first_valid_uid = 199 last_valid_uid = 201 lda_mailbox_autocreate = yes listen = * log_path = /var/log/dovecot/mail.log mail_debug = yes mail_fsync = always mail_location = maildir:~/Maildir:INDEX=/var/cache/dovecot/%2Mu/%2.2Mu/%u mail_nfs_index = yes mail_nfs_storage = yes mail_plugins = zlib quota mail_privileged_group = mail mail_uid = 200 managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date ihave mdbox_rotate_interval = 1 days mmap_disable = yes namespace { hidden = no inbox = yes list = yes location = prefix = INBOX. separator = . subscriptions = yes type = private } passdb { args = /opt/dovecot/etc/lmtp/sql.conf driver = sql } plugin { info_log_path = /var/log/dovecot/dovecot-deliver.log log_path = /var/log/dovecot/dovecot-deliver.log quota = maildir:User quota quota_rule = *:bytes=25M quota_rule2 = INBOX.Trash:bytes=+10%% quota_rule3 = *:messages=3000 sieve = ~/sieve/dovecot.sieve sieve_before = /etc/dovecot/scripts/spam.sieve sieve_dir = ~/sieve/ zlib_save = gz zlib_save_level = 3 } protocols = lmtp sieve service auth-worker { unix_listener auth-worker { mode = 0666 } user = doveauth } service auth { client_limit = 8000 unix_listener login/auth { mode = 0666 } user = doveauth } service lmtp { executable = lmtp -L process_min_avail = 10 unix_listener lmtp { mode = 0666 } } ssl = no userdb { driver = prefetch } userdb { args = /opt/dovecot/etc/lmtp/sql.conf driver = sql } verbose_proctitle = yes protocol lmtp { mail_plugins = zlib quota sieve }
Thanks!
On 1/12/2012 6:00 PM, Mark Moseley wrote:
Jan 12 20:30:33 auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
I've actually been meaning to send a similar message for the last couple of months :).
We run dovecot solely as a sasl authentication provider to postfix for smtp authentication. We're currently running 2.0.15 with a handful of patches from a few months ago when Timo fixed mysql failover.
We also see sporadic messages like that in the logs:
Jan 11 01:00:57 sparky dovecot: auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
We do have a timeout on the mysql servers, so I don't necessarily mind this message, except we also see some number of these:
Jan 11 01:00:57 sparky dovecot: auth-worker: Error: sql(clgeurts,108.38.64.98): Password query failed: MySQL server has gone away
The mysql servers have never been down or unresponsive, if it retries, it should succeed. I'm not sure what's happening here, perhaps it tries the query on one mysql server connection (we have two configured) which has timed out, and then tries the other one, and if the other one has also timed out just fails?
I also see some auth timeouts:
Jan 11 22:06:02 sparky dovecot: auth: CRAM-MD5(?,200.37.175.14): Request 10232.28 timeouted after 150 secs, state=2
I'm not sure if they're related to the mysql timeouts.
There are also some postfix auth errors:
Jan 11 23:55:41 sparky postfix/smtpd[20994]: warning: unknown[200.37.175.14]: SASL CRAM-MD5 authentication failed: Connection lost to authentication server
Which I think happen when dovecot takes too long to respond.
I haven't had time to dig into it or get any debugging info, but just thought I'd pipe up when I saw your similar question :).
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
On 13.1.2012, at 4.00, Mark Moseley wrote:
I'm running 2.0.17 and I'm still seeing a decent amount of "MySQL server has gone away" errors, despite having multiple hosts defined in my auth userdb 'connect'. This is Debian Lenny 32-bit and I'm seeing the same thing with 2.0.16 on Debian Squeeze 64-bit.
E.g.:
Jan 12 20:30:33 auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
Our mail mysql servers are busy enough that wait_timeout is set to a whopping 30 seconds. On my regular boxes, I see a good deal of these in the logs. I've been doing a lot of mucking with doveadm/dsync (working on maildir->mdbox migration finally, yay!) on test boxes (same dovecot package & version) and when I get this error, despite the log saying it's retrying, it doesn't seem to be. Instead I get:
dsync(root): Error: user ...: Auth USER lookup failed
Try with only one host in the "connect" string? My guess: Both the connections have timed out, and the retrying fails as well (there is only one retry). Although if the retrying lookup fails, there should be an error logged about it also (you don't see one?)
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
On Fri, Jan 13, 2012 at 1:36 AM, Timo Sirainen <tss@iki.fi> wrote:
On 13.1.2012, at 4.00, Mark Moseley wrote:
I'm running 2.0.17 and I'm still seeing a decent amount of "MySQL server has gone away" errors, despite having multiple hosts defined in my auth userdb 'connect'. This is Debian Lenny 32-bit and I'm seeing the same thing with 2.0.16 on Debian Squeeze 64-bit.
E.g.:
Jan 12 20:30:33 auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
Our mail mysql servers are busy enough that wait_timeout is set to a whopping 30 seconds. On my regular boxes, I see a good deal of these in the logs. I've been doing a lot of mucking with doveadm/dsync (working on maildir->mdbox migration finally, yay!) on test boxes (same dovecot package & version) and when I get this error, despite the log saying it's retrying, it doesn't seem to be. Instead I get:
dsync(root): Error: user ...: Auth USER lookup failed
Try with only one host in the "connect" string? My guess: Both the connections have timed out, and the retrying fails as well (there is only one retry). Although if the retrying lookup fails, there should be an error logged about it also (you don't see one?)
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
With just one 'connect' host, it seems to reconnect just fine (using the same tests as above) and I'm not seeing the same error. It worked every time that I tried, with no complaints of "MySQL server has gone away".
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Alternatively, since in all my tests, the mysql server has closed the connection prior to this, is the auth worker not recognizing its connection is already half-closed (in which case, it probably shouldn't even consider it a legitimate connection and just automatically reconnect, i.e. try #1, not the retry, which would happen after another failure).
I'll give the idle_kill a try too. I kind of like the idea of idle_kill for auth processes anyway, just to free up some connections on the mysql server.
Am 13.01.2012 19:29, schrieb Mark Moseley:
On Fri, Jan 13, 2012 at 1:36 AM, Timo Sirainen <tss@iki.fi> wrote:
On 13.1.2012, at 4.00, Mark Moseley wrote:
I'm running 2.0.17 and I'm still seeing a decent amount of "MySQL server has gone away" errors, despite having multiple hosts defined in my auth userdb 'connect'. This is Debian Lenny 32-bit and I'm seeing the same thing with 2.0.16 on Debian Squeeze 64-bit.
E.g.:
Jan 12 20:30:33 auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
Our mail mysql servers are busy enough that wait_timeout is set to a whopping 30 seconds. On my regular boxes, I see a good deal of these in the logs. I've been doing a lot of mucking with doveadm/dsync (working on maildir->mdbox migration finally, yay!) on test boxes (same dovecot package & version) and when I get this error, despite the log saying it's retrying, it doesn't seem to be. Instead I get:
dsync(root): Error: user ...: Auth USER lookup failed
Try with only one host in the "connect" string? My guess: Both the connections have timed out, and the retrying fails as well (there is only one retry). Although if the retrying lookup fails, there should be an error logged about it also (you don't see one?)
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
With just one 'connect' host, it seems to reconnect just fine (using the same tests as above) and I'm not seeing the same error. It worked every time that I tried, with no complaints of "MySQL server has gone away".
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Alternatively, since in all my tests, the mysql server has closed the connection prior to this, is the auth worker not recognizing its connection is already half-closed (in which case, it probably shouldn't even consider it a legitimate connection and just automatically reconnect, i.e. try #1, not the retry, which would happen after another failure).
I'll give the idle_kill a try too. I kind of like the idea of idle_kill for auth processes anyway, just to free up some connections on the mysql server.
by the way , if you use sql for auth have you tried auth caching ?
http://wiki.dovecot.org/Authentication/Caching
i.e.
# Authentication cache size (e.g. 10M). 0 means it's disabled. Note that # bsdauth, PAM and vpopmail require cache_key to be set for caching to be used.
auth_cache_size = 10M
# Time to live for cached data. After TTL expires the cached record is no # longer used, *except* if the main database lookup returns internal failure. # We also try to handle password changes automatically: If user's previous # authentication was successful, but this one wasn't, the cache isn't used. # For now this works only with plaintext authentication.
auth_cache_ttl = 1 hour
# TTL for negative hits (user not found, password mismatch). # 0 disables caching them completely.
auth_cache_negative_ttl = 0
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On Fri, Jan 13, 2012 at 11:38 AM, Robert Schetterer <robert@schetterer.org> wrote:
Am 13.01.2012 19:29, schrieb Mark Moseley:
On Fri, Jan 13, 2012 at 1:36 AM, Timo Sirainen <tss@iki.fi> wrote:
On 13.1.2012, at 4.00, Mark Moseley wrote:
I'm running 2.0.17 and I'm still seeing a decent amount of "MySQL server has gone away" errors, despite having multiple hosts defined in my auth userdb 'connect'. This is Debian Lenny 32-bit and I'm seeing the same thing with 2.0.16 on Debian Squeeze 64-bit.
E.g.:
Jan 12 20:30:33 auth-worker: Error: mysql: Query failed, retrying: MySQL server has gone away
Our mail mysql servers are busy enough that wait_timeout is set to a whopping 30 seconds. On my regular boxes, I see a good deal of these in the logs. I've been doing a lot of mucking with doveadm/dsync (working on maildir->mdbox migration finally, yay!) on test boxes (same dovecot package & version) and when I get this error, despite the log saying it's retrying, it doesn't seem to be. Instead I get:
dsync(root): Error: user ...: Auth USER lookup failed
Try with only one host in the "connect" string? My guess: Both the connections have timed out, and the retrying fails as well (there is only one retry). Although if the retrying lookup fails, there should be an error logged about it also (you don't see one?)
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
With just one 'connect' host, it seems to reconnect just fine (using the same tests as above) and I'm not seeing the same error. It worked every time that I tried, with no complaints of "MySQL server has gone away".
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Alternatively, since in all my tests, the mysql server has closed the connection prior to this, is the auth worker not recognizing its connection is already half-closed (in which case, it probably shouldn't even consider it a legitimate connection and just automatically reconnect, i.e. try #1, not the retry, which would happen after another failure).
I'll give the idle_kill a try too. I kind of like the idea of idle_kill for auth processes anyway, just to free up some connections on the mysql server.
by the way , if you use sql for auth have you tried auth caching ?
http://wiki.dovecot.org/Authentication/Caching
i.e.
# Authentication cache size (e.g. 10M). 0 means it's disabled. Note that # bsdauth, PAM and vpopmail require cache_key to be set for caching to be used.
auth_cache_size = 10M
# Time to live for cached data. After TTL expires the cached record is no # longer used, *except* if the main database lookup returns internal failure. # We also try to handle password changes automatically: If user's previous # authentication was successful, but this one wasn't, the cache isn't used. # For now this works only with plaintext authentication.
auth_cache_ttl = 1 hour
# TTL for negative hits (user not found, password mismatch). # 0 disables caching them completely.
auth_cache_negative_ttl = 0
Yup, we have caching turned on for our production boxes. On this particular box, I'd just shut off caching so that I could work on a script for converting from maildir->mdbox and run it repeatedly on the same mailbox. I got tired of restarting dovecot between each test :)
On Fri, Jan 13, 2012 at 11:38:28AM -0800, Robert Schetterer wrote:
by the way , if you use sql for auth have you tried auth caching ?
Hmm, hadn't tried that, but flipped it on to see how it might work out. The only tradeoff is a potential delay between when an account is disabled and when it can stop authenticating. I set the timeout to 10 minutes for now, with an hour timeout for negative caching.
That page says you can send a USR2 signal to the auth process for cache stats? That doesn't seem to work. OTOH, that page is for version 1, not 2; is there some other way to generate cache stats in version 2?
Thanks...
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
Am 14.01.2012 01:19, schrieb Paul B. Henson:
On Fri, Jan 13, 2012 at 11:38:28AM -0800, Robert Schetterer wrote:
by the way , if you use sql for auth have you tried auth caching ?
Hmm, hadn't tried that, but flipped it on to see how it might work out. The only tradeoff is a potential delay between when an account is disabled and when it can stop authenticating. I set the timeout to 10 minutes for now, with an hour timeout for negative caching.
dont know if i unserstand you right
perhaps this is what you mean, i use this with/cause fail2ban
# TTL for negative hits (user not found, password mismatch). # 0 disables caching them completely.
auth_cache_negative_ttl = 0
That page says you can send a USR2 signal to the auth process for cache stats? That doesn't seem to work. OTOH, that page is for version 1, not 2; is there some other way to generate cache stats in version 2?
auth cache works with dove 2, no idea about dove 1 ,didnt test, but i guess it does
Thanks...
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On Sat, Jan 14, 2012 at 12:01:12AM -0800, Robert Schetterer wrote:
Hmm, hadn't tried that, but flipped it on to see how it might work out. The only tradeoff is a potential delay between when an account is disabled and when it can stop authenticating. I set the timeout to 10 minutes for now, with an hour timeout for negative caching.
dont know if i unserstand you right
Before I turned on auth caching, every attempted authentication hit our mysql database, which in addition to the password itself contains a flag indicating whether or not the account is enabled. So if somebody was abusing smtp authentication, our helpdesk could disable their account, and it would *immediately* stop working. Whereas with authentication caching enabled, there is a window the size of the ttl where an account that has been disabled can continue to successfully authenticate.
That page says you can send a USR2 signal to the auth process for cache stats? That doesn't seem to work. OTOH, that page is for version 1, not 2; is there some other way to generate cache stats in version 2?
auth cache works with dove 2, no idea about dove 1 ,didnt test, but i guess it does
I'm using dovecot 2; my question was that the documentation for dovecot 1 described a way to make dovecot dump the authentication cache statistics that doesn't seem to work for dovecot 2, and if there was some other way to get the cache statistics in dovecot 2.
Thanks...
On 14.1.2012, at 2.19, Paul B. Henson wrote:
On Fri, Jan 13, 2012 at 11:38:28AM -0800, Robert Schetterer wrote:
by the way , if you use sql for auth have you tried auth caching ?
That page says you can send a USR2 signal to the auth process for cache stats? That doesn't seem to work. OTOH, that page is for version 1, not 2; is there some other way to generate cache stats in version 2?
Works for me. Are you maybe sending it to wrong auth process (auth worker instead of master)?
On 1/20/2012 9:17 AM, Timo Sirainen wrote:
Works for me. Are you maybe sending it to wrong auth process (auth worker instead of master)?
I had tried sending it to both; but the underlying problem turned out to be that the updated config hadn't actually been deployed yet 8-/ oops. Once I fixed that, sending the signal did generate the log output. Evidently nothing is printed out in the case where the authentication caching isn't enabled; maybe you should make it print out something like "Hey idiot, caching isn't turned on" ;).
Thanks...
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
On 1/13/2012 10:29 AM, Mark Moseley wrote:
connection prior to this, is the auth worker not recognizing its connection is already half-closed (in which case, it probably shouldn't even consider it a legitimate connection and just automatically reconnect, i.e. try #1, not the retry, which would happen after another failure).
I don't think there's any way to tell from the mysql api that the server has closed the connection short of trying to use it and getting that specific error. I suppose that specific error could be special cased as an immediate "try again with no penalty" rather than considered a failure.
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
On 13.1.2012, at 20.29, Mark Moseley wrote:
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Am 28.01.2012 21:07, schrieb Timo Sirainen:
On 13.1.2012, at 20.29, Mark Moseley wrote:
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Hi Timo
doc/example-config/dovecot-sql.conf.ext from hg has something like
# Database connection string. This is driver-specific setting. # HA / round-robin load-balancing is supported by giving multiple host # settings, like: host=sql1.host.org host=sql2.host.org
but i dont find it in http://wiki2.dovecot.org/AuthDatabase/SQL
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On 28.1.2012, at 23.06, Robert Schetterer wrote:
doc/example-config/dovecot-sql.conf.ext from hg has something like
# Database connection string. This is driver-specific setting. # HA / round-robin load-balancing is supported by giving multiple host # settings, like: host=sql1.host.org host=sql2.host.org
but i dont find it in http://wiki2.dovecot.org/AuthDatabase/SQL
I added something about it there.
Am 28.01.2012 23:17, schrieb Timo Sirainen:
On 28.1.2012, at 23.06, Robert Schetterer wrote:
doc/example-config/dovecot-sql.conf.ext from hg has something like
# Database connection string. This is driver-specific setting. # HA / round-robin load-balancing is supported by giving multiple host # settings, like: host=sql1.host.org host=sql2.host.org
but i dont find it in http://wiki2.dovecot.org/AuthDatabase/SQL
I added something about it there.
cool thanks !
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
Am 28.01.2012 22:06, schrieb Robert Schetterer:
Am 28.01.2012 21:07, schrieb Timo Sirainen:
On 13.1.2012, at 20.29, Mark Moseley wrote:
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Hi Timo
doc/example-config/dovecot-sql.conf.ext from hg has something like
# Database connection string. This is driver-specific setting. # HA / round-robin load-balancing is supported by giving multiple host # settings, like: host=sql1.host.org host=sql2.host.org
but i dont find it in http://wiki2.dovecot.org/AuthDatabase/SQL
Hi Timo,just to make sure i have an extra dovecot-dict-quota-sql.conf.ext ( dove 2.0.18 )
connect = host=192.168.123.150 dbname=.. user=... password=...
map { pattern = priv/quota/storage table = quota2 username_field = username value_field = bytes } map { pattern = priv/quota/messages table = quota2 username_field = username value_field = messages }
is it possible to have i.e
connect = host=192.168.123.150 host=127.0.0.1 dbname=.. user=... password=...
there too?
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On Wed, 2012-02-15 at 09:10 +0100, Robert Schetterer wrote:
Hi Timo,just to make sure i have an extra dovecot-dict-quota-sql.conf.ext ( dove 2.0.18 )
connect = host=192.168.123.150 dbname=.. user=... password=... .. is it possible to have i.e
connect = host=192.168.123.150 host=127.0.0.1 dbname=.. user=... password=...
there too?
Assuming your SQL database is configured to do synchronous multi-master replication, sure.
Am 15.02.2012 14:07, schrieb Timo Sirainen:
On Wed, 2012-02-15 at 09:10 +0100, Robert Schetterer wrote:
Hi Timo,just to make sure i have an extra dovecot-dict-quota-sql.conf.ext ( dove 2.0.18 )
connect = host=192.168.123.150 dbname=.. user=... password=... .. is it possible to have i.e
connect = host=192.168.123.150 host=127.0.0.1 dbname=.. user=... password=...
there too?
Assuming your SQL database is configured to do synchronous multi-master replication, sure.
jep ,thx Timo
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
Am 15.02.2012 17:10, schrieb Robert Schetterer:
Am 15.02.2012 14:07, schrieb Timo Sirainen:
On Wed, 2012-02-15 at 09:10 +0100, Robert Schetterer wrote:
Hi Timo,just to make sure i have an extra dovecot-dict-quota-sql.conf.ext ( dove 2.0.18 )
connect = host=192.168.123.150 dbname=.. user=... password=... .. is it possible to have i.e
connect = host=192.168.123.150 host=127.0.0.1 dbname=.. user=... password=...
there too?
Assuming your SQL database is configured to do synchronous multi-master replication, sure.
jep ,thx Timo
Hi Timo, sorry for some more question what is the default behave if dict quota over sql cant be reached i.e with lmtp
something like log warning and deliver anyway ?
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On 15.2.2012, at 20.22, Robert Schetterer wrote:
sorry for some more question what is the default behave if dict quota over sql cant be reached i.e with lmtp
something like log warning and deliver anyway ?
I think it tempfails. Try.
Am 15.02.2012 19:43, schrieb Timo Sirainen:
On 15.2.2012, at 20.22, Robert Schetterer wrote:
sorry for some more question what is the default behave if dict quota over sql cant be reached i.e with lmtp
something like log warning and deliver anyway ?
I think it tempfails. Try.
hm, test servers arent up recently cant do it on production
but i will test before i go try dove 2.1 and report, @list perhaps anyone other having it tested before?
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
Am 15.02.2012 20:10, schrieb Robert Schetterer:
Am 15.02.2012 19:43, schrieb Timo Sirainen:
On 15.2.2012, at 20.22, Robert Schetterer wrote:
sorry for some more question what is the default behave if dict quota over sql cant be reached i.e with lmtp
something like log warning and deliver anyway ?
I think it tempfails. Try.
hm, test servers arent up recently cant do it on production
but i will test before i go try dove 2.1 and report, @list perhaps anyone other having it tested before?
ok the test server was good enough for small tests with 2.0.13 dove
private/dovecot-lmtp] said: 451 4.2.0 <....@....com> Internal error occurred. Refer to server log for more information.
dict: Error: mysql(localhost2): Connect failed to database (postfix): Unknown MySQL server host '...' (1) - waiting for 125 seconds before retry
youre right looks like default is tempfail 451
so it stays in postfix queue and is permanent retry
after sql comming back , its getting delivered i.e force with Flush Queue
so thats good enough for me i think
Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On Sat, Jan 28, 2012 at 12:07 PM, Timo Sirainen <tss@iki.fi> wrote:
On 13.1.2012, at 20.29, Mark Moseley wrote:
If there are multiple hosts, it seems like the most robust thing to do would be to exhaust the existing connections and if none of those succeed, then start a new connection to one of them. It will probably result in much more convoluted logic but it'd probably match better what people expect from a retry.
Excellent, thanks!
On Fri, Jan 13, 2012 at 01:36:38AM -0800, Timo Sirainen wrote:
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
Ah, set the auth-worker timeout to less than the mysql timeout to prevent a stale mysql connection from ever being used. I'll try that, thanks.
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
On Fri, Jan 13, 2012 at 2:46 PM, Paul B. Henson <henson@acm.org> wrote:
On Fri, Jan 13, 2012 at 01:36:38AM -0800, Timo Sirainen wrote:
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
Ah, set the auth-worker timeout to less than the mysql timeout to prevent a stale mysql connection from ever being used. I'll try that, thanks.
I gave that a try. Sometimes it seems to kill off the auth-worker but not till after a minute or so (with idle_kill = 20). Other times, the worker stays around for more like 5 minutes (I gave up watching), despite being idle -- and I'm the only person connecting to it, so it's definitely idle. Does auth-worker perhaps only wake up every so often to check its idle status?
To test, I kicked off a dsync, then grabbed a netstat:
tcp 0 0 10.1.15.129:40070 10.1.52.47:3306 ESTABLISHED 29146/auth worker [ tcp 0 0 10.1.15.129:33369 10.1.52.48:3306 ESTABLISHED 29146/auth worker [ tcp 0 0 10.1.15.129:54083 10.1.52.49:3306 ESTABLISHED 29146/auth worker [
then kicked off this loop:
# while true; do date; ps p 29146 |tail -n1; sleep 1; done
Fri Jan 13 18:05:14 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb] Fri Jan 13 18:05:15 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb]
.... More lines of the loop ...
Fri Jan 13 18:05:35 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb] 18:05:36.252976 IP 10.1.52.48.3306 > 10.1.15.129.33369: F 77:77(0) ack 92 win 91 <nop,nop,timestamp 1850213473 320254609> 18:05:36.288549 IP 10.1.15.129.33369 > 10.1.52.48.3306: . ack 78 win 913 <nop,nop,timestamp 320257515 1850213473> Fri Jan 13 18:05:36 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb] 18:05:37.196204 IP 10.1.52.49.3306 > 10.1.15.129.54083: F 806:806(0) ack 1126 win 123 <nop,nop,timestamp 1534230122 320254609> 18:05:37.228594 IP 10.1.15.129.54083 > 10.1.52.49.3306: . ack 807 win 1004 <nop,nop,timestamp 320257609 1534230122> 18:05:37.411955 IP 10.1.52.47.3306 > 10.1.15.129.40070: F 806:806(0) ack 1126 win 123 <nop,nop,timestamp 774321777 320254650> 18:05:37.448573 IP 10.1.15.129.40070 > 10.1.52.47.3306: . ack 807 win 1004 <nop,nop,timestamp 320257631 774321777> Fri Jan 13 18:05:37 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb]
... more lines of the loop ...
Fri Jan 13 18:10:13 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb] Fri Jan 13 18:10:14 EST 2012 29146 ? S 0:00 dovecot/auth worker [0 wait, 0 passdb, 0 userdb] ^C
at which point I bailed out. Looking again a couple of minutes later, it was gone. Nothing else was going on and the logs don't show any activity between 18:05:07 and 18:10:44.
On 14.1.2012, at 1.19, Mark Moseley wrote:
Also another idea to avoid them in the first place:
service auth-worker { idle_kill = 20 }
Ah, set the auth-worker timeout to less than the mysql timeout to prevent a stale mysql connection from ever being used. I'll try that, thanks.
I gave that a try. Sometimes it seems to kill off the auth-worker but not till after a minute or so (with idle_kill = 20). Other times, the worker stays around for more like 5 minutes (I gave up watching), despite being idle -- and I'm the only person connecting to it, so it's definitely idle. Does auth-worker perhaps only wake up every so often to check its idle status?
This is fixed in v2.1 hg. The default idle_kill of 60 seconds seems to have gotten rid of the "MySQL server has gone away" errors completely. So I guess the problem was that during some peak times a ton of auth worker processes were created, but afterwards they weren't used until the next peak happened, and then they failed.
http://hg.dovecot.org/dovecot-2.1/rev/3963862a4086 http://hg.dovecot.org/dovecot-2.1/rev/58556a90259f
On 20.1.2012, at 19.16, Timo Sirainen wrote:
This is fixed in v2.1 hg. The default idle_kill of 60 seconds seems to have gotten rid of the "MySQL server has gone away" errors completely. So I guess the problem was that during some peak times a ton of auth worker processes were created, but afterwards they weren't used until the next peak happened, and then they failed.
Hmh. Still doesn't work 100%:
auth-worker(28788): Error: mysql: Query failed, retrying: MySQL server has gone away (idled for 181 secs) auth-worker(7413): Error: mysql: Query failed, retrying: MySQL server has gone away (idled for 298 secs)
I'm not really sure why it's not killing itself after 60 seconds of idling. Probably related to how mysql code tracks idle time and how idle_kill tracks it.. Anyway, those errors are much more rare now.
On 1/20/2012 1:06 PM, Timo Sirainen wrote:
Hmh. Still doesn't work 100%:
auth-worker(28788): Error: mysql: Query failed, retrying: MySQL server has gone away (idled for 181 secs) auth-worker(7413): Error: mysql: Query failed, retrying: MySQL server has gone away (idled for 298 secs)
I'm not really sure why it's not killing itself after 60 seconds of idling. Probably related to how mysql code tracks idle time and how idle_kill tracks it.. Anyway, those errors are much more rare now.
The mysql server starts tracking idle time as beginning after the last network communication with the client. So presumably if the auth worker gets marked as not idle by anything not involving interaction with the mysql server, they could get out of sync.
Before you posted a potential fix to the idle timeout, I was looking at other possible ways to resolve the issue. Currently, an authentication request is tried exactly twice -- one initial try, and one retry.
Looking at driver-sqlpool.c:
if (result->failed_try_retry && !request->retried) {
Currently, retried is a boolean. What if retried was an integer instead, and a new configuration variable allowed you to specify how many times an authentication attempt should be retried? The default could be 2, which would result in exactly the same behavior. But then you could set it to 3 or 4 to prevent a request from hitting a timed out connection twice and failing completely.
Ideally, a better fix would be for the client not to consider a "MySQL server has gone away" return as a failure, but instead immediately reconnect and try again without marking it as a retry. However, from reviewing the code, that would be a much more difficult and invasive change. Changing the existing retried variable to an integer count rather than a boolean is pretty simple.
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
On Fri, Jan 20, 2012 at 09:16:57AM -0800, Timo Sirainen wrote:
This is fixed in v2.1 hg. The default idle_kill of 60 seconds seems to have gotten rid of the "MySQL server has gone away" errors completely. So I guess the problem was that during some peak times a ton of auth worker processes were created, but afterwards they weren't used until the next peak happened, and then they failed.
http://hg.dovecot.org/dovecot-2.1/rev/3963862a4086 http://hg.dovecot.org/dovecot-2.1/rev/58556a90259f
Hmm, I tried to apply this to 2.0.17, and that didn't really work out. Before I spend too much time trying to hand port the changes, do you know off hand if they simply won't apply to 2.0.17 due to other changes made since then? It looks like 2.1 might be out soon, I guess maybe I should just wait for that.
Thanks...
-- Paul B. Henson | (909) 979-6361 | http://www.csupomona.edu/~henson/ Operating Systems and Network Analyst | henson@csupomona.edu California State Polytechnic University | Pomona CA 91768
participants (4)
-
Mark Moseley
-
Paul B. Henson
-
Robert Schetterer
-
Timo Sirainen