[Dovecot] auth-worker temporary failures causing lmtp 500 rejection
Hi there,
I'm using dovecot 2.0.16 with a mysql user database. From time to time when we have a big influx of messages (perhaps more than 30 concurrent rcpt to:<> sessions at the same time so no auth-workers free?) or when we have a transient issue connecting to the database server, we see the message:
Jan 25 16:38:23 mailbox dovecot: auth-worker: sql(foo@bar.com,1.2.3.4): Unknown user
and the lmtp process returns:
550 5.1.1 <foo@bar.com> User doesn't exist: foo@bar.com
This would be correct for a permanent error where the user doesn't exist in our database, however it seems to be doing this on transient errors too. Is this an issue with the code or perhaps some setting I have missed?
Thanks,
Mark
On 26.1.2012, at 12.14, Mark Zealey wrote:
I'm using dovecot 2.0.16 with a mysql user database. From time to time when we have a big influx of messages (perhaps more than 30 concurrent rcpt to:<> sessions at the same time so no auth-workers free?) or when we have a transient issue connecting to the database server, we see the message:
Jan 25 16:38:23 mailbox dovecot: auth-worker: sql(foo@bar.com,1.2.3.4): Unknown user
This happens only when the SQL query doesn't return any rows, but does return success.
and the lmtp process returns:
550 5.1.1 <foo@bar.com> User doesn't exist: foo@bar.com
This would be correct for a permanent error where the user doesn't exist in our database, however it seems to be doing this on transient errors too. Is this an issue with the code or perhaps some setting I have missed?
The problem is that temporary errors are returning "unknown user". Can you reproduce this somehow? Like if you stop MySQL it always returns that "Unknown user"?
I've tried reproducing by having long running auth queries in the sql and KILLing them on the server, restarting the mysql service, and setting max auth workers to 1 and running 2 sessions at the same time (with long-running auth queries), but to no effect. There must be something else going on here; I saw it in particular when exim on our frontend servers had queued a large number of messages and suddenly released them all at once hence the auth-worker hypothesis although the log messages do not support this. I'll try to see if I can trigger this manually although we have been doing some massively parallel testing previously and not seen this.
Mark
From: Timo Sirainen [tss@iki.fi] Sent: 26 January 2012 12:31 To: Mark Zealey Cc: dovecot@dovecot.org Subject: Re: [Dovecot] auth-worker temporary failures causing lmtp 500 rejection
On 26.1.2012, at 12.14, Mark Zealey wrote:
I'm using dovecot 2.0.16 with a mysql user database. From time to time when we have a big influx of messages (perhaps more than 30 concurrent rcpt to:<> sessions at the same time so no auth-workers free?) or when we have a transient issue connecting to the database server, we see the message:
Jan 25 16:38:23 mailbox dovecot: auth-worker: sql(foo@bar.com,1.2.3.4): Unknown user
This happens only when the SQL query doesn't return any rows, but does return success.
and the lmtp process returns:
550 5.1.1 <foo@bar.com> User doesn't exist: foo@bar.com
This would be correct for a permanent error where the user doesn't exist in our database, however it seems to be doing this on transient errors too. Is this an issue with the code or perhaps some setting I have missed?
The problem is that temporary errors are returning "unknown user". Can you reproduce this somehow? Like if you stop MySQL it always returns that "Unknown user"?
On 26/01/2012 14:37, Mark Zealey wrote:
I've tried reproducing by having long running auth queries in the sql and KILLing them on the server, restarting the mysql service, and setting max auth workers to 1 and running 2 sessions at the same time (with long-running auth queries), but to no effect. There must be something else going on here; I saw it in particular when exim on our frontend servers had queued a large number of messages and suddenly released them all at once hence the auth-worker hypothesis although the log messages do not support this. I'll try to see if I can trigger this manually although we have been doing some massively parallel testing previously and not seen this.
Could it be a *timeout* rather than lack of worker processes? Theory would be that disk starvation causes other processes to take a long time to respond, hence the worker is *alive*, but doesn't return a response quickly enough, which in turn causes the "unknown user" message?
You could try a different disk io scheduler, or ionice to control the effect of these big bursts of disk activity on other processes?
(Most MTA programs such as postfix and qmail do a lot of fsyncs - this will cause a lot of IO activity and could easily starve other processes on the same box?)
Good luck
Ed W
On 26.1.2012, at 18.06, Ed W wrote:
Could it be a *timeout* rather than lack of worker processes?
The message in log was "Unknown user". The only reason this happens is if MySQL library's query functions returned success without any rows. No timeouts, crashes, or anything else can give that error message. So I'd the problem is either in MySQL library or MySQL server.
Try if the attached patch gives any crashes. If it does, it means that mysql library returned mysql_errno()=0 (success) even though it should have returned a failure. Or you could even change it to only:
i_assert(result->result != NULL);
if you're not using MySQL for anything else than auth. The other possibility is if in driver_mysql_result_next_row() the mysql_fetch_row() returns NULL, but also there I'm checking mysql_errno().
participants (4)
-
Ed W
-
Mark Zealey
-
Mark Zealey
-
Timo Sirainen