On Fri, 2011-09-09 at 19:33 -0700, Paul B. Henson wrote:
According to the sample SQL configuration file "HA / round-robin load-balancing is supported by giving multiple host settings, like: host=sql1.host.org host=sql2.host.org".
However, as far as I can tell dovecot only connects to the first listed host, and processes all queries through it, there does not appear to be any load-balancing going on.
The current code creates connection to the second server only when the first connection is already busy with an SQL query, or when it's not working. Once there are more connections, it starts doing round robin lookups.
This works okay enough with PostgreSQL because it does asynchronous lookups, so two simultaneous lookups create a second connection. MySQL does synchronous lookups though, so the second connection is normally never created.
I suppose the fix to this would be to always connect to all SQL servers at startup.
That's not necessarily a dealbreaker; however, high-availability does not appear to be working either.
If I shutdown the first mysql server, dovecot starts to log connection failures:
Sep 9 15:47:34 tweak dovecot: auth: Error: mysql(mysql-1.unx.csupomona.edu): Connect failed to database (idmgmt): Can't connect to MySQL server on 'mysql-1.unx.csupomona.edu' (111) - waiting for 1 seconds before retry
Sep 9 15:47:39 tweak dovecot: auth: Error: mysql(mysql-1.unx.csupomona.edu): Connect failed to database (idmgmt): Can't connect to MySQL server on 'mysql-1.unx.csupomona.edu' (111) - waiting for 25 seconds before retry
Those are intentional.
And postfix starts to fail authentications:
Sep 9 15:47:35 tweak postfix/smtpd[5119]: warning: bender.iitsys.csupomona.edu[134.71.250.134]: SASL DIGEST-MD5 authentication failed: Connection lost to authentication server
It should have created the second connection here and not fail..
Now and again the authentication process dies:
Sep 9 15:47:39 tweak dovecot: auth: Panic: file auth-request-handler.c: line 697 (auth_request_handler_flush_failures): assertion failed: (auth_request->state == AUTH_REQUEST_STATE_FINISHED)
And this of course shouldn't happen either.
Requests start to pile up:
Sep 9 15:51:46 tweak dovecot: auth: Warning: auth workers: Auth request was queued for 25 seconds, 45 left in queue
Lookups time out:
Sep 9 15:57:22 tweak dovecot: auth: Error: auth worker: Aborted request: Lookup timed out
These are the result of the previous failures.
This occasionally pops up:
Sep 9 15:58:38 tweak dovecot: auth: Fatal: net_connect_unix(auth-worker) failed: Resource temporarily unavailable
Probably this too.
And sometimes the auth process gets temporarily disabled:
Sep 9 15:58:57 tweak dovecot: master: Error: service(auth): command startup failed, throttling
Most likely related to the crash, although I think this still shouldn't have happened.
I don't think all authentications fail during the scenario, but I think the majority do. Based on the network traffic, dovecot is almost continuously trying to connect to the first listed server. It sometimes connects to the second listed server, but when it does, the connection does not persist, it goes away almost immediately.
There are multiple auth-worker processes, each one having their own internal MySQL connections with separate retry counters.
I'll try to debug this soon.