On 08 Mar 2011, at 13:24, Chris Wilson wrote:
Hi Thierry,
On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
On 07 Mar 2011, at 19:15, Timo Sirainen wrote:
On Mon, 2011-03-07 at 19:03 +0200, Thierry de Montaudry wrote:
>>> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailable .. As it is happening at least once a day, is there anything I can do to trace it? and whatever I'll do, will it slow down those machines?
Set verbose_proctitle=yes (won't slow down) and get list of all Dovecot processes when it happens. And check how much user and system CPU it's using and what the load is.
Got the same problem this morning, here is the CPU usage and ps aux for dovecot. plus the different error I could pick up in the log, most of them are repeated a couple of times.
I suspect it a problem with system resources, but can find any message to tell me what. Mail are stored on 17 NFS servers (CentOS), plus 3 servers for indexes only.
CPU load is very high, but mainly from httpd running our webmail interface, which uses the local imap server. [...] top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55 Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached
You're lucky this server is still alive and that you could even run top and ps on it.
There's nothing to debug in dovecot here. Your server is overloaded by about 55 times. Buy 55 times as many servers or do something about your webmail interface (maybe a separate webmail cluster).
Cheers, Chris.
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.