On 26/01/2012 14:37, Mark Zealey wrote:
I've tried reproducing by having long running auth queries in the sql and KILLing them on the server, restarting the mysql service, and setting max auth workers to 1 and running 2 sessions at the same time (with long-running auth queries), but to no effect. There must be something else going on here; I saw it in particular when exim on our frontend servers had queued a large number of messages and suddenly released them all at once hence the auth-worker hypothesis although the log messages do not support this. I'll try to see if I can trigger this manually although we have been doing some massively parallel testing previously and not seen this.
Could it be a *timeout* rather than lack of worker processes? Theory would be that disk starvation causes other processes to take a long time to respond, hence the worker is *alive*, but doesn't return a response quickly enough, which in turn causes the "unknown user" message?
You could try a different disk io scheduler, or ionice to control the effect of these big bursts of disk activity on other processes?
(Most MTA programs such as postfix and qmail do a lot of fsyncs - this will cause a lot of IO activity and could easily starve other processes on the same box?)
Good luck
Ed W