On 8/21/2013 5:37 AM, Kavish Karkera wrote:
We have a serious issue running on our POP/IMAP servers these days. The load average of a servers spikes up to 400-500 as a uptime command result, for a particular time period , to be specific mostly in noon time and evening, but it last for few minutes only.
We have 2 servers running dovecot 1.1.20 , in loadbanlancer, We have used KEEPLIVE (1.1.13) for loadbalacing.
Server specification. Operating System : CentOS 5.5 64bit CPU cores : 16 RAM : 8GB
Mail and Indexes are mounted on NFS (NetApp). ...
Cpu(s): 8.3%us, 7.6%sy, 0.0%ni, 8.3%id, 75.0%wa, 0.0%hi, 0.8%si, 0.0%st ^^^^^^^ PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
408 mysql 18 0 384m 38m 4412 S 52.8 0.2 42221:44 mysqld
This doesn't seem to be a dovecot issue. mysql has apparently 8 (or more) threads on 8 cores all blocking on IO. I see a few possible causes.
- The NetApp is unable to keep up with the request rate because:
a. There are too few spindles in the RAID set backing this NFS volume and/or the file(s) aren't properly striped across all spindles
b. An inappropriate RAID level. The mysql job is apparently doing large table updates and you're experiencing massive RMW latency from RAID5/6. This is why one should never put a transactional database, or one that sees large frequent table updates, on a parity RAID volume--unless the disks are SSD. SSDs have no mechanical parts, thus RMW latency is almost nonexistent.
- Apparently 8 (or more) threads are concurrently accessing the same file or files. Thus the massive iowait could simply be the result of filesystem and/or NFS locking, NFS client caching issues, etc.
The cause of the massive iowait could be one or all of the above, or could be something else entirely. These are the typical causes.
You seem to have a database job scheduled to run twice daily that triggers the problem. Identify this job, figure out what it does, why it does it, how necessary it is, and if it can be scheduled to run at off peak hours. If it can you may want to simply do so, as it may be expensive, in hardware and/or labor dollars, to fix the IO latency problem.
-- Stan