[Dovecot] Deliver: command time exceed limit
We use Dovecot on a cluster (10'000 domains, 30'000 account) without any problem with pop/imap. We use Dovecot 1.0 RC10
Now we use virtual transport from postfix for mail delivery.
We need the Dovecot LDA for sieve support.
We test this solution on some domains with success.
But if we change the configuration of all our domains, we have a lots of bounce with this error in postfix log:
postfix/pipe[24573]: 4F6D7CF11: to=info@XXXX.com, relay=dovecot, delay=1200, status=bounced (Command time limit exceeded: "/usr/local/dovecot/libexec/dovecot/deliver")
If I watch the deliver activity with ps/awk/watch, I see that if deliver duration time exceed some seconds, the deliver process stay in memory but do nothing. Here a strace output of this case:
Process 349 attached - interrupt to quit gettimeofday({1161965973, 768478}, {0, 0}) = 0 poll(
Strace output nothing before the SIGKILL from postfix (command time exceed 1200 second), the end of the strace:
[{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=9, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 2147483647) = -1 EINTR (Interrupted system call) +++ killed by SIGKILL +++
I'm not sure is this problem has a relation with our deliver problem. But in the deliver log /var/log/dovecot/deliver, I have a lots of message:
deliver(info@labomex.com): 2006.10.27 18:29:28 Error: file_dotlock_replace(/var/mail/labomex.com/mails/info/.dovecot.lda-dupes) failed: No such file or directory deliver(info@labomex.com): 2006.10.27 18:29:28 Error: rename(/var/mail/labomex.com/mails/info/.dovecot.lda-dupes.lock, /var/mail/labomex.com/mails/info/.dovecot.lda-dupes) failed: No such file or directory
Our dovecot.conf (lda parts):
protocol lda { postmaster_address = postmaster@clm.net4all.ch #hostname = mail_plugins = cmusieve mail_plugin_dir = /usr/local/dovecot/lib/dovecot/lda sendmail_path = /usr/sbin/sendmail auth_socket_path = /var/run/dovecot/auth-master log_timestamp = %Y.%m.%d %H:%M:%S%t log_path = /var/log/dovecot/deliver info_log_path = /var/log/dovecot/deliver.info }
We use on UID/GID per domain so every account of a domain use the same UID/GID.
The mail storage is on a Debian Sarge NFS server with 2.4.27-3-686-smp. Indexes are stored localy on each POP/IMAP/LDA server.
I don't understand why we have this problem. If somebody can help us, thanks a lots.
Dominique Feyer
After a lots of test, I found that deliver (LDA Dovecot) put the message in the mailbox of the user, but the process dont quit. After the max command time (from postfix), postfix kill the process, return an error in the log and bounce the message.
But always no idea, why this deliver process dont quit after delivery.
Thanks
Le Fri, 27 Oct 2006 18:47:37 +0200, Dominique Feyer dfeyer@net4all.ch a écrit :
We use Dovecot on a cluster (10'000 domains, 30'000 account) without any problem with pop/imap. We use Dovecot 1.0 RC10
Now we use virtual transport from postfix for mail delivery.
We need the Dovecot LDA for sieve support.
We test this solution on some domains with success.
But if we change the configuration of all our domains, we have a lots of bounce with this error in postfix log:
postfix/pipe[24573]: 4F6D7CF11: to=info@XXXX.com, relay=dovecot, delay=1200, status=bounced (Command time limit exceeded: "/usr/local/dovecot/libexec/dovecot/deliver")
If I watch the deliver activity with ps/awk/watch, I see that if deliver duration time exceed some seconds, the deliver process stay in memory but do nothing. Here a strace output of this case:
Process 349 attached - interrupt to quit gettimeofday({1161965973, 768478}, {0, 0}) = 0 poll(
Strace output nothing before the SIGKILL from postfix (command time exceed 1200 second), the end of the strace:
[{fd=5, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}, {fd=9, events=POLLIN|POLLPRI|POLLERR|POLLHUP|POLLNVAL}], 2, 2147483647) = -1 EINTR (Interrupted system call) +++ killed by SIGKILL +++
I'm not sure is this problem has a relation with our deliver problem. But in the deliver log /var/log/dovecot/deliver, I have a lots of message:
deliver(info@labomex.com): 2006.10.27 18:29:28 Error: file_dotlock_replace(/var/mail/labomex.com/mails/info/.dovecot.lda-dupes) failed: No such file or directory deliver(info@labomex.com): 2006.10.27 18:29:28 Error: rename(/var/mail/labomex.com/mails/info/.dovecot.lda-dupes.lock, /var/mail/labomex.com/mails/info/.dovecot.lda-dupes) failed: No such file or directory
Our dovecot.conf (lda parts):
protocol lda { postmaster_address = postmaster@clm.net4all.ch #hostname = mail_plugins = cmusieve mail_plugin_dir = /usr/local/dovecot/lib/dovecot/lda sendmail_path = /usr/sbin/sendmail auth_socket_path = /var/run/dovecot/auth-master log_timestamp = %Y.%m.%d %H:%M:%S%t log_path = /var/log/dovecot/deliver info_log_path = /var/log/dovecot/deliver.info }
We use on UID/GID per domain so every account of a domain use the same UID/GID.
The mail storage is on a Debian Sarge NFS server with 2.4.27-3-686-smp. Indexes are stored localy on each POP/IMAP/LDA server.
I don't understand why we have this problem. If somebody can help us, thanks a lots.
Dominique Feyer
On Mon, 2006-10-30 at 14:43 +0100, Dominique Feyer wrote:
After a lots of test, I found that deliver (LDA Dovecot) put the message in the mailbox of the user, but the process dont quit. After the max command time (from postfix), postfix kill the process, return an error in the log and bounce the message.
This conflicts with the only reason that I can see for this:
Process 349 attached - interrupt to quit gettimeofday({1161965973, 768478}, {0, 0}) = 0 poll(
This should only happen at startup when it's connecting to dovecot-auth. So my guess would have been that dovecot-auth is busy and not answering to our requests.
I guess I should put some kind of a timeout to this myself also..
deliver(info@labomex.com): 2006.10.27 18:29:28 Error: file_dotlock_replace(/var/mail/labomex.com/mails/info/.dovecot.lda-dupes) failed: No such file or directory deliver(info@labomex.com): 2006.10.27 18:29:28 Error: rename(/var/mail/labomex.com/mails/info/.dovecot.lda-dupes.lock, /var/mail/labomex.com/mails/info/.dovecot.lda-dupes) failed: No such file or directory
Hmm.. Something seems to be overriding or deleting the dotlocks.. Probably because the deliver hangs for a long time somewhere. Possibly when trying to send mails?
On Thu, 2006-11-02 at 20:04 +0200, Timo Sirainen wrote:
Process 349 attached - interrupt to quit gettimeofday({1161965973, 768478}, {0, 0}) = 0 poll(
This should only happen at startup when it's connecting to dovecot-auth. So my guess would have been that dovecot-auth is busy and not answering to our requests.
I guess I should put some kind of a timeout to this myself also..
Done, see if this starts giving timeout errors (or wait for rc11 which should come in a couple of days):
http://dovecot.org/list/dovecot-cvs/2006-November/006644.html
After upgrade to the RC12 no more time limit exceed in postfix log, but in the deliver log, I have:
Error: User request from dovecot-auth timed out
How can I improve the performance of the dovecot-auth ?
Can I use the dovecot-auth cache ? It's disable in our configuration ?
Thanks
Le jeudi 02 novembre 2006 à 20:11 +0200, Timo Sirainen a écrit :
On Thu, 2006-11-02 at 20:04 +0200, Timo Sirainen wrote:
Process 349 attached - interrupt to quit gettimeofday({1161965973, 768478}, {0, 0}) = 0 poll(
This should only happen at startup when it's connecting to dovecot-auth. So my guess would have been that dovecot-auth is busy and not answering to our requests.
I guess I should put some kind of a timeout to this myself also..
Done, see if this starts giving timeout errors (or wait for rc11 which should come in a couple of days):
http://dovecot.org/list/dovecot-cvs/2006-November/006644.html
On Wed, 2006-11-08 at 13:44 +0100, Dominique Feyer wrote:
After upgrade to the RC12 no more time limit exceed in postfix log, but in the deliver log, I have:
Error: User request from dovecot-auth timed out
How can I improve the performance of the dovecot-auth ?
Can I use the dovecot-auth cache ? It's disable in our configuration ?
If the deliver is giving such timeouts, then I'd guess that you'll get similar problems for pop/imap logins as well.
What passdb and userdb are you using? auth-cache might help, but I'm guessing the problem is that you're using some blocking passdb/userdb which causes the whole process to hang while waiting for a reply from somewhere.
How many dovecot-auth processes do you have? The default is one (auth section, count=1). Growing that might help. How large is the CPU usage by the dovecot-auth?
Here is my configuration (auth parts):
auth_cache_size = 0 auth_cache_ttl = 3600 auth default { mechanisms = plain login passdb sql { args = /etc/dovecot/dovecot-pgsql.conf } userdb sql { args = /etc/dovecot/dovecot-pgsql.conf } socket listen { master { path = /var/run/dovecot/auth-master mode = 0660 user = vmail group = mail } } user = dovecot count = 1 #ssl_require_client_cert = no #ssl_username_from_cert = no }
The socket is use by deliver ? Right ? So I can't set count to more than 1, if I do that dovecto don't start, the process try to create the socket, but after the first process start, the socket exist.
I try to add an other section without socket to have more dovecot-auth process:
auth normal { mechanisms = plain login passdb sql { args = /etc/dovecot/dovecot-pgsql.conf } userdb sql { args = /etc/dovecot/dovecot-pgsql.conf } user = dovecot count = 48 #ssl_require_client_cert = no #ssl_username_from_cert = no }
Our backend i a pgsql server with a average load of 0.10-0.15
/etc/dovecot/dovecot-pgsql.conf
driver = pgsql connect = host=10.#.#.# dbname=mail user=mailreader password=secret default_pass_scheme = CRYPT password_query = SELECT password FROM users WHERE userid = '%u' user_query = SELECT '/var/mail/'||home AS home, uid, gid FROM users WHERE userid = '%u'
The CPU usage by dovecot-auth is quiet normal. The server has load always under 0.8.
Thanks a lots for your help,
Le mercredi 08 novembre 2006 à 14:51 +0200, Timo Sirainen a écrit :
On Wed, 2006-11-08 at 13:44 +0100, Dominique Feyer wrote:
After upgrade to the RC12 no more time limit exceed in postfix log, but in the deliver log, I have:
Error: User request from dovecot-auth timed out
How can I improve the performance of the dovecot-auth ?
Can I use the dovecot-auth cache ? It's disable in our configuration ?
If the deliver is giving such timeouts, then I'd guess that you'll get similar problems for pop/imap logins as well.
What passdb and userdb are you using? auth-cache might help, but I'm guessing the problem is that you're using some blocking passdb/userdb which causes the whole process to hang while waiting for a reply from somewhere.
How many dovecot-auth processes do you have? The default is one (auth section, count=1). Growing that might help. How large is the CPU usage by the dovecot-auth?
On Wed, 2006-11-08 at 13:30 +0000, Dominique Feyer wrote:
passdb sql { args = /etc/dovecot/dovecot-pgsql.conf } userdb sql { args = /etc/dovecot/dovecot-pgsql.conf } .. The socket is use by deliver ? Right ? So I can't set count to more than 1, if I do that dovecto don't start, the process try to create the socket, but after the first process start, the socket exist.
Yea, right, it won't work if external sockets have been created..
But you're using PostgreSQL which doesn't do any blocking, so it shouldn't have these kind of timeout problems. I'm not really sure how to debug this further without knowing exactly what's happening in dovecot-auth when deliver is hanging.
Can you reproduce this with any tests? Such as delivering lots of mails in a short time? If so, could you do a strace -tt of the dovecot-auth and preferrably also the hanging deliver process?
I've also attached a patch, which you could try to see if it changes anything, but I don't think it should.
participants (2)
-
Dominique Feyer
-
Timo Sirainen