Re: [Dovecot] POP3 error
On 08 Mar 2011, at 13:24, Chris Wilson wrote:
Hi Thierry,
On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
On 07 Mar 2011, at 19:15, Timo Sirainen wrote:
On Mon, 2011-03-07 at 19:03 +0200, Thierry de Montaudry wrote:
>>> Mar 7 11:19:51 xxx dovecot: pop3-login: Error: net_connect_unix(pop3) failed: Resource temporarily unavailable .. As it is happening at least once a day, is there anything I can do to trace it? and whatever I'll do, will it slow down those machines?
Set verbose_proctitle=yes (won't slow down) and get list of all Dovecot processes when it happens. And check how much user and system CPU it's using and what the load is.
Got the same problem this morning, here is the CPU usage and ps aux for dovecot. plus the different error I could pick up in the log, most of them are repeated a couple of times.
I suspect it a problem with system resources, but can find any message to tell me what. Mail are stored on 17 NFS servers (CentOS), plus 3 servers for indexes only.
CPU load is very high, but mainly from httpd running our webmail interface, which uses the local imap server. [...] top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55 Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached
You're lucky this server is still alive and that you could even run top and ps on it.
There's nothing to debug in dovecot here. Your server is overloaded by about 55 times. Buy 55 times as many servers or do something about your webmail interface (maybe a separate webmail cluster).
Cheers, Chris.
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
On 2011-03-08 10:40 AM, Thierry de Montaudry wrote:
On 08 Mar 2011, at 13:24, Chris Wilson wrote:
There's nothing to debug in dovecot here. Your server is overloaded by about 55 times. Buy 55 times as many servers or do something about your webmail interface (maybe a separate webmail cluster).
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
What is your webmail server (and version)? Maybe it is buggy?
--
Best regards,
Charles
On 08 Mar 2011, at 18:14, Charles Marcus wrote:
On 2011-03-08 10:40 AM, Thierry de Montaudry wrote:
On 08 Mar 2011, at 13:24, Chris Wilson wrote:
There's nothing to debug in dovecot here. Your server is overloaded by about 55 times. Buy 55 times as many servers or do something about your webmail interface (maybe a separate webmail cluster).
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
What is your webmail server (and version)? Maybe it is buggy?
Using HastyMail2-1.0. But the problem only started when we moved to dovecot 2.0.9 (from 1.10.13), without changing anything else on any of our 7 machines, and now it's happening randomly on any of them. So that's why I suspect it has to do with dovecot.
On 2011-03-08 11:49 AM, Thierry de Montaudry wrote:
Using HastyMail2-1.0. But the problem only started when we moved to dovecot 2.0.9 (from 1.10.13), without changing anything else on any of our 7 machines, and now it's happening randomly on any of them. So that's why I suspect it has to do with dovecot.
Or an interaction of the new version of Dovecot and HastyMail.
The reason I asked about your webmail server is you had specifically said that it was the httpd process that was consuming all of the CPU...
--
Best regards,
Charles
On 08 Mar 2011, at 19:11, Charles Marcus wrote:
On 2011-03-08 11:49 AM, Thierry de Montaudry wrote:
Using HastyMail2-1.0. But the problem only started when we moved to dovecot 2.0.9 (from 1.10.13), without changing anything else on any of our 7 machines, and now it's happening randomly on any of them. So that's why I suspect it has to do with dovecot.
Or an interaction of the new version of Dovecot and HastyMail.
The reason I asked about your webmail server is you had specifically said that it was the httpd process that was consuming all of the CPU...
Yes, because they were in the top of the top list.
On 2011-03-08 12:30 PM, Thierry de Montaudry wrote:
On 08 Mar 2011, at 19:11, Charles Marcus wrote:
The reason I asked about your webmail server is you had specifically said that it was the httpd process that was consuming all of the CPU...
Yes, because they were in the top of the top list.
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot?
--
Best regards,
Charles
On 08 Mar 2011, at 19:37, Charles Marcus wrote:
On 2011-03-08 12:30 PM, Thierry de Montaudry wrote:
On 08 Mar 2011, at 19:11, Charles Marcus wrote:
The reason I asked about your webmail server is you had specifically said that it was the httpd process that was consuming all of the CPU...
Yes, because they were in the top of the top list.
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot?
But then why was it fine with 1.1.13, which never had once this problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem?
On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:
On 08 Mar 2011, at 19:37, Charles Marcus wrote:
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot?
But then why was it fine with 1.1.13, which never had once this problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem?
You don't see how it might be possible that 2.0.x does something that 1.1.x didn't do that your webmail app might not like, without it being a dovecot bug?
I'm not saying it is or it isn't, but I'd look there first - see if an update is available for your webmail app... since you were running an ancient version of dovecot, maybe you're also running an ancient version of it too?
--
Best regards,
Charles
On 03/08/2011 06:51 PM, Charles Marcus wrote:
On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot? But then why was it fine with 1.1.13, which never had once this
On 08 Mar 2011, at 19:37, Charles Marcus wrote: problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem? You don't see how it might be possible that 2.0.x does something that 1.1.x didn't do that your webmail app might not like, without it being a dovecot bug?
I'm not saying it is or it isn't, but I'd look there first - see if an update is available for your webmail app... since you were running an ancient version of dovecot, maybe you're also running an ancient version of it too?
I can see similar problems (subject: "Restarting dovecot-auth stops authentication"), on a different OS, and nothing common in the webmail area.
I think this is clearly related to Dovecot. It handles load very badly (well, it seems at least on common OS settings), doesn't just slow down, but starts to refuse clients. It seems to be obvious that the interprocess socket communication is where it fails, so this is what needs to be investigated. Sadly, doing this on a machine, which cries for a deep breath already is not always easy.
On 2011-03-08 3:38 PM, Attila Nagy wrote:
On 03/08/2011 06:51 PM, Charles Marcus wrote:
You don't see how it might be possible that 2.0.x does something that 1.1.x didn't do that your webmail app might not like, without it being a dovecot bug?
I'm not saying it is or it isn't, but I'd look there first - see if an update is available for your webmail app... since you were running an ancient version of dovecot, maybe you're also running an ancient version of it too?
I can see similar problems (subject: "Restarting dovecot-auth stops authentication"), on a different OS, and nothing common in the webmail area.
Similar problem? I just read that entire thread, and there was absolutely no mention of high resource usage, and it was the 4th or 5th email before you finally provided system details (which should always be provided in the first email to save time) and Timo noticed that you had changed some defaults that you shouldn't have... so I don't think that thread qualifies as being anywhere near similar.
I think this is clearly related to Dovecot. It handles load very badly
Whoa, pardner, fyi, there are many, many installations humming along smoothly.
(well, it seems at least on common OS settings), doesn't just slow down, but starts to refuse clients.
Maybe there is a bug somewhere that only becomes evident under certain circumstances, but it is also possibly due to config problems caused by...
--
Best regards,
Charles
On 03/08/2011 09:58 PM, Charles Marcus wrote:
I think this is clearly related to Dovecot. It handles load very badly Whoa, pardner, fyi, there are many, many installations humming along smoothly. No offense. It may be more correct to say situations, where the OS can't deliver prompt resources to Dovecot, like saturated disk IO and similar stuff. I can't see such problems with moderate load, and maybe there aren't so many installations, which handle a lot of traffic. I don't know. I don't think it's a bug, currently to me it seems to be a tuning/configuration issue. But maybe it's a common design related issue, which is not yet fully explored. (well, it seems at least on common OS settings), doesn't just slow down, but starts to refuse clients. Maybe there is a bug somewhere that only becomes evident under certain circumstances, but it is also possibly due to config problems caused by... Sure.
Am 08.03.2011 21:38, schrieb Attila Nagy:
On 03/08/2011 06:51 PM, Charles Marcus wrote:
On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot? But then why was it fine with 1.1.13, which never had once this
On 08 Mar 2011, at 19:37, Charles Marcus wrote: problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem? You don't see how it might be possible that 2.0.x does something that 1.1.x didn't do that your webmail app might not like, without it being a dovecot bug?
I'm not saying it is or it isn't, but I'd look there first - see if an update is available for your webmail app... since you were running an ancient version of dovecot, maybe you're also running an ancient version of it too?
I can see similar problems (subject: "Restarting dovecot-auth stops authentication"), on a different OS, and nothing common in the webmail area.
I think this is clearly related to Dovecot. It handles load very badly (well, it seems at least on common OS settings), doesn't just slow down, but starts to refuse clients. It seems to be obvious that the interprocess socket communication is where it fails, so this is what needs to be investigated. Sadly, doing this on a machine, which cries for a deep breath already is not always easy.
you might upgrade to the latest 2.x code as it might possible your using more stuff then you had in older versions, after all there was a long performance thread on this list , look for it in archives
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
On 03/08/2011 10:37 PM, Robert Schetterer wrote:
On 03/08/2011 06:51 PM, Charles Marcus wrote:
On 2011-03-08 12:42 PM, Thierry de Montaudry wrote:
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot? But then why was it fine with 1.1.13, which never had once this
On 08 Mar 2011, at 19:37, Charles Marcus wrote: problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem? You don't see how it might be possible that 2.0.x does something that 1.1.x didn't do that your webmail app might not like, without it being a dovecot bug?
I'm not saying it is or it isn't, but I'd look there first - see if an update is available for your webmail app... since you were running an ancient version of dovecot, maybe you're also running an ancient version of it too?
I can see similar problems (subject: "Restarting dovecot-auth stops authentication"), on a different OS, and nothing common in the webmail area.
I think this is clearly related to Dovecot. It handles load very badly (well, it seems at least on common OS settings), doesn't just slow down, but starts to refuse clients. It seems to be obvious that the interprocess socket communication is where it fails, so this is what needs to be investigated. Sadly, doing this on a machine, which cries for a deep breath already is not always easy. you might upgrade to the latest 2.x code as it might possible your using more stuff
Am 08.03.2011 21:38, schrieb Attila Nagy: then you had in older versions, after all there was a long performance thread on this list , look for it in archives
I'm running the latest 2.x code (well, sort of, I haven't upgraded to 2.0.10, because of the LDAP bug, so I have both .9 and .11), I've never run 1.x on these machines. I've run qmail and courier. They are pretty different in their architecture, where these kind of stuff (unix socket communication between persisently running daemons) is not touched, so there can't be a problem, where for example five thousand connections are made in the same moment to a single socket/process. There there will be five thousand forks/execs, which won't fail with connection refused, they will be served as fast as the machine can handle them (modulo available memory/file descriptors/etc of course).
On 8.3.2011, at 19.42, Thierry de Montaudry wrote:
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot?
But then why was it fine with 1.1.13, which never had once this problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem?
One possibility is that maybe v2.0 works a bit differently.. Maybe it causes webmail to use a new feature that wasn't yet in v1.1, which causes more CPU?
I also just heard that apparently this "Resource temporarily unavailable" can happen if service imap/pop3-login's client_limit is too large. I'm not really sure why, but you could try reducing them to e.g. 50.
Do you remember how high the CPU usage was at peak times in v1.1? Has that changed? Is the problem maybe that v2.0 just fails in a different way by logging these failures, where v1.1 wouldn't even accept as many incoming connections?
On 09 Mar 2011, at 20:16, Timo Sirainen wrote:
On 8.3.2011, at 19.42, Thierry de Montaudry wrote:
So... if the httpd process is the one consuming all of the CPU, doesn't it stand to reason that it might be something to do with one of your web apps, and not dovecot?
But then why was it fine with 1.1.13, which never had once this problem in 2 years? or is 2.0.9 slower, or consuming more resources to create the problem?
One possibility is that maybe v2.0 works a bit differently.. Maybe it causes webmail to use a new feature that wasn't yet in v1.1, which causes more CPU? Yes, possibly. I will investigate the features that the webmail might use now that it was not previously.
I also just heard that apparently this "Resource temporarily unavailable" can happen if service imap/pop3-login's client_limit is too large. I'm not really sure why, but you could try reducing them to e.g. 50. I reduced the limits with process_limit. I'm wondering if I should use the client_limit as well, but couldn't find much documentation, would you have any light on that?
Do you remember how high the CPU usage was at peak times in v1.1? Has that changed? Is the problem maybe that v2.0 just fails in a different way by logging these failures, where v1.1 wouldn't even accept as many incoming connections? v1.1 was about the same as current, load avg between 3 and 4 from 9pm to 4am, no change on that side. It looks like it's just when there is spikes, the new version is reaching some limit.
On 2011-03-08 12:30 PM, Thierry de Montaudry wrote:
On 08 Mar 2011, at 19:11, Charles Marcus wrote:
The reason I asked about your webmail server is you had specifically said that it was the httpd process that was consuming all of the CPU...
Yes, because they were in the top of the top list.
And they were on the top of the list because... they were consuming all of the CPU?
--
Best regards,
Charles
Hi Thierry,
On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
On 08 Mar 2011, at 13:24, Chris Wilson wrote:
top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55 Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
I don't have a definite answer, but I remember that there has been a long-running bug in the Linux kernel with schedulers behaving badly under heavy writes:
"One of the problems commonly talked about in our forums and elsewhere is the poor responsiveness of the Linux desktop when dealing with significant disk activity on systems where there is insufficient RAM or the disks are slow. The GUI basically drops to its knees when there is too much disk activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] (note, it's not just the GUI, all other tasks can starve when a disk I/O queue builds up).
"There are a few options to tune the linux IO scheduler that can help a bunch... Typically CFQ stalls too long under heavy writes, especially if your disk subsystem sucks, so particularly if you have several spindles deadline is worth a try." [http://communities.vmware.com/thread/82544]
"I run Ubuntu on a moderately powerful quad-core x86-64 system and the desktop response is basically crippled whenever something is reading or writing large files as fast as it can (at normal priority)... For example, cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely unusable because of the I/O latency." [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]
"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O ^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result was, well, absolutely, positively _DEVASTATING_. The entire system became _FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 worked (took 20 seconds for even this key to be respected)." [http://lkml.org/lkml/2010/4/4/86]
"This regression has been around since about the 2.6.18 timeframe and has eluded a lot of testing to isolate the root cause. The most promising fix is in the VM subsystem (mm) where the LRU scan has been changed to favor keeping executable pages active longer. Most of these symptoms come down to VM thrashing to make room for I/O pages. The key change/commit is ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable pages the first class citizen"... This change was merged into the 2.6.31r1 kernel." [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]
One possible cause is that writing to a slow device can block the write queue for other devices, causing the machine to come to a standstill when there's plenty of useful work that it could be doing.
This could cause a cascading failure in your server as soon as disk I/O write load goes over a certain point, a bit like a swap death. I'm not sure if the fact that you're using NFS makes a difference; perhaps only if you memory-map files?
You could test this by booting with the NOOP or anticipatory scheduler instead of the default CFQ to see if it makes any difference.
Cheers, Chris.
Aptivate | http://www.aptivate.org | Phone: +44 1223 760887 The Humanitarian Centre, Fenner's, Gresham Road, Cambridge CB1 2ES
Aptivate is a not-for-profit company registered in England and Wales with company number 04980791.
On 03/08/2011 09:26 AM, Chris Wilson wrote:
Hi Thierry,
On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
On 08 Mar 2011, at 13:24, Chris Wilson wrote:
top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55 Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
I don't have a definite answer, but I remember that there has been a long-running bug in the Linux kernel with schedulers behaving badly under heavy writes:
"One of the problems commonly talked about in our forums and elsewhere is the poor responsiveness of the Linux desktop when dealing with significant disk activity on systems where there is insufficient RAM or the disks are slow. The GUI basically drops to its knees when there is too much disk activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] (note, it's not just the GUI, all other tasks can starve when a disk I/O queue builds up).
"There are a few options to tune the linux IO scheduler that can help a bunch... Typically CFQ stalls too long under heavy writes, especially if your disk subsystem sucks, so particularly if you have several spindles deadline is worth a try." [http://communities.vmware.com/thread/82544]
"I run Ubuntu on a moderately powerful quad-core x86-64 system and the desktop response is basically crippled whenever something is reading or writing large files as fast as it can (at normal priority)... For example, cat /path/to/LARGE_FILE> /dev/null ... Everything else gets completely unusable because of the I/O latency." [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]
"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O ^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result was, well, absolutely, positively _DEVASTATING_. The entire system became _FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 worked (took 20 seconds for even this key to be respected)." [http://lkml.org/lkml/2010/4/4/86]
"This regression has been around since about the 2.6.18 timeframe and has eluded a lot of testing to isolate the root cause. The most promising fix is in the VM subsystem (mm) where the LRU scan has been changed to favor keeping executable pages active longer. Most of these symptoms come down to VM thrashing to make room for I/O pages. The key change/commit is ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable pages the first class citizen"... This change was merged into the 2.6.31r1 kernel." [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]
One possible cause is that writing to a slow device can block the write queue for other devices, causing the machine to come to a standstill when there's plenty of useful work that it could be doing.
This could cause a cascading failure in your server as soon as disk I/O write load goes over a certain point, a bit like a swap death. I'm not sure if the fact that you're using NFS makes a difference; perhaps only if you memory-map files?
You could test this by booting with the NOOP or anticipatory scheduler instead of the default CFQ to see if it makes any difference.
Cheers, Chris.
You can change it on the fly with:
echo noop > /sys/block/${DEVICE}/queue/scheduler
-- -Eric 'shubes'
On 08 Mar 2011, at 18:26, Chris Wilson wrote:
Hi Thierry,
On Tue, 8 Mar 2011, Thierry de Montaudry wrote:
On 08 Mar 2011, at 13:24, Chris Wilson wrote:
top - 11:10:14 up 14 days, 12:04, 2 users, load average: 55.04, 29.13, 14.55 Tasks: 474 total, 60 running, 414 sleeping, 0 stopped, 0 zombie Cpu(s): 99.6%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 16439812k total, 16353268k used, 86544k free, 33268k buffers Swap: 4192956k total, 140k used, 4192816k free, 8228744k cached
As you can see the numbers (55.04, 29.13, 14.55) the load was busy getting higher when I took this snapshot and this was not a normal situation. Usually this machine's load is only between 1 and 4, which is quite ok for a quad core. It only happens when dovecot start generating errors, and pop3, imap and http get stuck. It went up to 200, and I was still able to stop web and mail daemons, then restart them, and everything was back to normal.
I don't have a definite answer, but I remember that there has been a long-running bug in the Linux kernel with schedulers behaving badly under heavy writes:
"One of the problems commonly talked about in our forums and elsewhere is the poor responsiveness of the Linux desktop when dealing with significant disk activity on systems where there is insufficient RAM or the disks are slow. The GUI basically drops to its knees when there is too much disk activity..." [http://www.phoronix.com/scan.php?page=news_item&px=ODQ3Mw] (note, it's not just the GUI, all other tasks can starve when a disk I/O queue builds up).
"There are a few options to tune the linux IO scheduler that can help a bunch... Typically CFQ stalls too long under heavy writes, especially if your disk subsystem sucks, so particularly if you have several spindles deadline is worth a try." [http://communities.vmware.com/thread/82544]
"I run Ubuntu on a moderately powerful quad-core x86-64 system and the desktop response is basically crippled whenever something is reading or writing large files as fast as it can (at normal priority)... For example, cat /path/to/LARGE_FILE > /dev/null ... Everything else gets completely unusable because of the I/O latency." [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/343371]
"I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O ^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result was, well, absolutely, positively _DEVASTATING_. The entire system became _FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 worked (took 20 seconds for even this key to be respected)." [http://lkml.org/lkml/2010/4/4/86]
"This regression has been around since about the 2.6.18 timeframe and has eluded a lot of testing to isolate the root cause. The most promising fix is in the VM subsystem (mm) where the LRU scan has been changed to favor keeping executable pages active longer. Most of these symptoms come down to VM thrashing to make room for I/O pages. The key change/commit is ab4754d24a0f2e05920170c845bd84472814c6, "vmscan: make mapped executable pages the first class citizen"... This change was merged into the 2.6.31r1 kernel." [https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235]
One possible cause is that writing to a slow device can block the write queue for other devices, causing the machine to come to a standstill when there's plenty of useful work that it could be doing.
This could cause a cascading failure in your server as soon as disk I/O write load goes over a certain point, a bit like a swap death. I'm not sure if the fact that you're using NFS makes a difference; perhaps only if you memory-map files?
You could test this by booting with the NOOP or anticipatory scheduler instead of the default CFQ to see if it makes any difference.
Cheers, Chris.
Hi Chris,
Thanks for your (long) comment and tech details, but having not changed anything on the 7 machines, but moving from dovecot 1.10.13 to 2.0.9, without increasing our traffic, I don't want to start changing tricky stuff in the system when it worked fine for almost 2 years. And the fact that all mails are stored on multiple NFS servers, all machine having 16G RAM, makes me think that it's not an IO problem. I though it might be the system running out of resources, but there nothing about it in the logs... For now, we might consider reversing to 1.10.13... but that would be with the loss of the new features that made us upgrade, so not good.
On 08 Mar 2011, at 19:12, Charles Marcus wrote:
On 2011-03-08 12:00 PM, Thierry de Montaudry wrote:
but moving from dovecot 1.10.13 to 2.0.9
First time I thought it was a typo and ignored it...
There has never been a version 1.10.xxx
Maybe you mean 1.0.13?
Sorry, my mistake, 1.1.13, version integrated in CentOS 5.
participants (7)
-
Attila Nagy
-
Charles Marcus
-
Chris Wilson
-
Eric Shubert
-
Robert Schetterer
-
Thierry de Montaudry
-
Timo Sirainen