On 10/26/2012 1:29 PM, Milan Holzäpfel wrote:
On Wed, 24 Oct 2012 09:01:24 -0500 Stan Hoeppner <stan@hardwarefreak.com> wrote:
On 10/24/2012 6:28 AM, Milan Holzäpfel wrote:
I have a problem with an incosistent mdbox: ... four hours after the problem initially appeared, I did a hard reset of the system because it was unresponsive. ... Can anybody say something about this? May the mdbox be repaired?
If the box is truly unresponsive, i.e. hard locked, then the corrupted indexes are only a symptom of the underlying problem, which is unrelated to Dovecot, UNLESS, the lack of responsiveness was due to massive disk access, which will occur when rebuilding indexes on a 6.6GB mailbox. You need to know the difference so we have accurate information to troubleshoot with.
Thanks for your suggestion. I wasn't looking for a solution for the unresponsiveness, but I failed to make that clear.
It's likely all related. If you have already, or will continue to, hard reset the box, you will lose inflight data in the buffer cache, which may very likely corrupt your mdbox files and/or indexes. I'm a bit shocked you'd hard reset a *slow* responding server. Especially one that appears to be unresponsive due to massive disk IO. That's a recipe for disaster...
I was not patient enough to debug the unresponsiveness issue. The box was not hard locked, but any command took very look if it would at all complete. I think that it could be massive swapping, but I wouldn't expect Dovecot to be the cause.
This leads me to believe your filesystem root, swap partition, and Dovecot mailbox storage are all on the same disk, or small RAID set. Is this correct?
After the reboot, Dovecot would happily re-execute the failing index rebuild on each new incoming message, which suggests that Dovecot wasn't the cause for the unresponsiveness.
This operation is a tiny IO pattern compared to the 6.6GB re-indexing operation you mentioned before. So you can't make the simple assumption that "Dovecot wasn't the cause for the unresponsiveness". If fact Dovecot likely instigated the problem, though it likely isn't the "cause". I'll take a stab at that below.
If the there's a kernel or hardware problem, you should see related errors in dmesg. Please share those.
The kernel had messages like
INFO: task cron:2799 blocked for more than 120 seconds.
Now we're getting some meat on this plate.
in the dmesg. But again, I didn't mean to ask for a solution to this problem.
"blocked for more than 120 seconds" is a kernel warning message, not an error message. We see this quite often on the XFS list. Rarely, this is related to a kernel bug. Most often the cause of this warning is saturated IO. In this case it appears cron blocked for 120s because it couldn't read /var/cron/crontabs/[user]
The most likely cause of this is that so many IO requests are piled up in the queue that it took more than 2 minutes for the hardware (disks) to complete them before servicing the cron process' IO requests. Dovecot re-indexing a 6.6GB mailbox, with other IO occurring concurrently, could easily cause this situation if you don't have sufficient spindle IOPS. I.e. this IO pattern will bring a single SATA disk or mirror pair to its knees.
If you currently have everything on a single SATA disk or mirror pair, the solution for eliminating the bogging down of the system, and likely the Dovecot issues related to it, is to simply separate your root filesystem, swap, and Dovecot data files onto different physical devices. For instance, moving the root filesystem and swap to a small SSD will prevent the OS unresponsiveness, even if Dovecot is bogged down with IO to the SATA disk.
With spinning rust storage, separation of root filesystem, swap, and application data to different storage IO domains is system administration 101 kind of stuff. If you're using SSD this isn't (as) critical as it's pretty hard to saturate the IO limits of an SSD.
-- Stan