[Dovecot] Corrupted index cache file and high CPU usage
I suspect, that dovecot v2.0.16 under Linux (tested on Gentoo 3.0.4 Hardened, Gentoo 3.0.6, Gentoo 3.1.4) causes 100% cpu utilization when index.cache is broken and doveadm is lanched. Dovecot uses LDAP as the userdb and provides IMAP and LDA. Because each user has a quota, the doveadm's expunge functionality is used to get rid of old trashed mail.
The mail log: Dec 12 07:38:59 www dovecot: imap(user): Error: Corrupted index cache file /var/data/mail/domain.tld/user/mail/dovecot.index.cache: invalid record size
Doveadm is launched as: doveadm -D expunge -A mailbox Trash savedbefore 1w
Mail is stored on ext4 and flush will use all of the CPU capacity like this: load average: 4.00, 4.01, 4.05
ps: root 26401 98.2 0.0 0 0 ? R Dec20 1369:49 [flush-253:1]
lsblk: vg_data-mail (dm-1) 253:1 0 500M 0 /var/data/mail
dovecot -n: # 2.0.16: /etc/dovecot/dovecot.conf # OS: Linux 3.0.4-hardened-r5 x86_64 Gentoo Base System release 2.0.3 auth_debug = yes auth_debug_passwords = yes auth_verbose = yes hostname = mail.domain.tld lda_mailbox_autocreate = yes lda_mailbox_autosubscribe = yes listen = * mail_debug = yes mail_gid = vmail mail_home = /var/data/mail/domain.tld/%n mail_location = maildir:~/mail mail_plugins = " mail_log notify zlib quota" mail_uid = vmail managesieve_notify_capability = mailto managesieve_sieve_capability = fileinto reject envelope encoded-character vacation subaddress comparator-i;ascii-numeric relational regex imap4flags copy include variables body enotify environment mailbox date ihave passdb { args = /etc/dovecot/dovecot-ldap.conf.ext driver = ldap } plugin { autocreate = Sent autocreate2 = Trash autocreate3 = Spam autosubscribe = Sent autosubscribe2 = Trash autosubscribe3 = Spam fts = squat fts_squat = partial=4 full=10 quota = maildir:User quota quota_rule = *:storage=100M quota_rule2 = Trash:ignore quota_warning = storage=95%% quota-warning 95 %u quota_warning2 = storage=80%% quota-warning 80 %u sieve = ~/.dovecot.sieve sieve_dir = ~/sieve sieve_quota_max_scripts = 5 sieve_quota_max_storage = 5M } postmaster_address = postmaster@domain.tld protocols = imap sieve service auth { unix_listener /var/spool/postfix/private/auth { group = postfix mode = 0660 user = postfix } unix_listener auth-userdb { group = vmail mode = 0600 user = vmail } } service quota-warning { executable = script /etc/dovecot/quota-warning.sh unix_listener quota-warning { user = vmail } user = vmail } ssl_cert =
userdb { args = /etc/dovecot/dovecot-ldap.conf.ext driver = ldap } verbose_proctitle = yes verbose_ssl = yes protocol lda { info_log_path = log_path = mail_plugins = " mail_log notify zlib quota sieve" } protocol imap { mail_plugins = " mail_log notify zlib quota autocreate fts fts_squat imap_zlib imap_quota" }
dovecot-ldap.conf.ext: hosts = localhost dn = user dnpass = pass base = base auth_bind = yes user_attrs = homeDirectory=home,uidNumber=uid,gidNumber=gid,mailboxQuota=quota_rule=*:bytes=%$ user_filter = (&(objectClass=inetOrgPerson)(uid=%u)) pass_attrs = uid=user pass_filter = (&(objectClass=inetOrgPerson)(uid=%u)) iterate_attrs = uid=user iterate_filter = (objectClass=inetOrgPerson)
I suspect, that this is something to do with Dovecot, because after deleting the dovecot.index.cache file, everything went back to normal. When this happens, I cannot unmount the drive nor a system reboot works.
Thank you !
On 21.12.2011, at 18.38, hydra wrote:
I suspect, that dovecot v2.0.16 under Linux (tested on Gentoo 3.0.4 Hardened, Gentoo 3.0.6, Gentoo 3.1.4) causes 100% cpu utilization when index.cache is broken and doveadm is lanched. Dovecot uses LDAP as the userdb and provides IMAP and LDA. Because each user has a quota, the doveadm's expunge functionality is used to get rid of old trashed mail.
The mail log: Dec 12 07:38:59 www dovecot: imap(user): Error: Corrupted index cache file /var/data/mail/domain.tld/user/mail/dovecot.index.cache: invalid record size
OK..
Mail is stored on ext4 and flush will use all of the CPU capacity like this: load average: 4.00, 4.01, 4.05
ps: root 26401 98.2 0.0 0 0 ? R Dec20 1369:49 [flush-253:1]
That's a kernel process..
I suspect, that this is something to do with Dovecot, because after deleting the dovecot.index.cache file, everything went back to normal. When this happens, I cannot unmount the drive nor a system reboot works.
That's a kernel bug..
I think you're thinking it the wrong way: Dovecot isn't causing your system to break. Your system is causing Dovecot to break. Faulty hardware or faulty kernel.
Hello Timo, thank you for the reply. I was suspecting the same. However:
- the machine runs under Vmware,
- I've tried 3 different kernel versions,
- I've tried 3 different SCSI controllers.
All same results.
On Wed, Dec 21, 2011 at 6:16 PM, Timo Sirainen tss@iki.fi wrote:
On 21.12.2011, at 18.38, hydra wrote:
I suspect, that dovecot v2.0.16 under Linux (tested on Gentoo 3.0.4 Hardened, Gentoo 3.0.6, Gentoo 3.1.4) causes 100% cpu utilization when index.cache is broken and doveadm is lanched. Dovecot uses LDAP as the userdb and provides IMAP and LDA. Because each user has a quota, the doveadm's expunge functionality is used to get rid of old trashed mail.
The mail log: Dec 12 07:38:59 www dovecot: imap(user): Error: Corrupted index cache file /var/data/mail/domain.tld/user/mail/dovecot.index.cache: invalid record size
OK..
Mail is stored on ext4 and flush will use all of the CPU capacity like this: load average: 4.00, 4.01, 4.05
ps: root 26401 98.2 0.0 0 0 ? R Dec20 1369:49 [flush-253:1]
That's a kernel process..
I suspect, that this is something to do with Dovecot, because after deleting the dovecot.index.cache file, everything went back to normal. When this happens, I cannot unmount the drive nor a system reboot works.
That's a kernel bug..
I think you're thinking it the wrong way: Dovecot isn't causing your system to break. Your system is causing Dovecot to break. Faulty hardware or faulty kernel.
On 12/22/2011 8:08 AM, hydra wrote:
Hello Timo, thank you for the reply. I was suspecting the same. However:
- the machine runs under Vmware,
- I've tried 3 different kernel versions,
- I've tried 3 different SCSI controllers.
All same results.
dmesg output? Log errors?
Is your EXT4 filesystem on a VMFS volume or an RDM (SAN LUN)?
On Wed, Dec 21, 2011 at 6:16 PM, Timo Sirainen tss@iki.fi wrote:
On 21.12.2011, at 18.38, hydra wrote:
That's a kernel process..
I suspect, that this is something to do with Dovecot, because after deleting the dovecot.index.cache file, everything went back to normal. When this happens, I cannot unmount the drive nor a system reboot works.
System (host machine) reboot, or virtual machine reboot doesn't fix the problem? FYI, Linux doesn't unmount drives, it unmounts filesystems.
I'd say you may have a problem with your VMFS volume or RDM, or maybe just your EXT4 filesystem. Have you run an fsck on it? What result?
Or, as Timo suggests, could be a kernel bug. Or an interaction of these low level layers causing a problem. If you can't unmount a filesystem, that has nothing to do with Dovecot, and points to a much larger, more critical, problem.
Do you have this problem when booting an older kernel? Say 2.6.32? 2.6.37?
That's a kernel bug..
I think you're thinking it the wrong way: Dovecot isn't causing your system to break. Your system is causing Dovecot to break. Faulty hardware or faulty kernel.
-- Stan
On Thu, Dec 22, 2011 at 3:49 PM, Stan Hoeppner stan@hardwarefreak.comwrote:
On 12/22/2011 8:08 AM, hydra wrote:
Hello Timo, thank you for the reply. I was suspecting the same. However:
- the machine runs under Vmware,
- I've tried 3 different kernel versions,
- I've tried 3 different SCSI controllers.
All same results.
dmesg output? Log errors?
Nothing there
Is your EXT4 filesystem on a VMFS volume or an RDM (SAN LUN)?
VMFS
On Wed, Dec 21, 2011 at 6:16 PM, Timo Sirainen tss@iki.fi wrote:
On 21.12.2011, at 18.38, hydra wrote:
That's a kernel process..
I suspect, that this is something to do with Dovecot, because after deleting the dovecot.index.cache file, everything went back to normal. When this happens, I cannot unmount the drive nor a system reboot works.
System (host machine) reboot, or virtual machine reboot doesn't fix the problem? FYI, Linux doesn't unmount drives, it unmounts filesystems.
After the virtual machine reboot, the CPU usage is normal again, but just until doveadm is launched again (it was run from cron). Sorry for the partition/drive terminology mess up.
I'd say you may have a problem with your VMFS volume or RDM, or maybe just your EXT4 filesystem. Have you run an fsck on it? What result?
A normal system reboot wasn't possible, because the ext4 fs wasn't unmounted (and it wasn't possible to unmount the fs nor run sync - both locked up) and thus I had to reboot from the vSphere Client. After the reboot, fsck placed the fs to a consistent state, however the problem occurred the next morning, when doveadm from the cron was run again. So a fsck didn't help.
Or, as Timo suggests, could be a kernel bug. Or an interaction of these low level layers causing a problem. If you can't unmount a filesystem, that has nothing to do with Dovecot, and points to a much larger, more critical, problem.
Do you have this problem when booting an older kernel? Say 2.6.32? 2.6.37?
The oldest available kernel is 2.6.32 so I'm going to test it.
Thank you :)
That's a kernel bug..
I think you're thinking it the wrong way: Dovecot isn't causing your system to break. Your system is causing Dovecot to break. Faulty hardware or faulty kernel.
-- Stan
On 22.12.2011, at 18.06, hydra wrote:
After the virtual machine reboot, the CPU usage is normal again, but just until doveadm is launched again (it was run from cron). Sorry for the partition/drive terminology mess up.
So it's reproducible? You could see if you can reproduce it by simply reading the file(s) that are causing problems. Like with zsh: cat **/* > /dev/null
Yes, it is reproducible.
About the file, I have tarred up the original maildir before I deleted the dovecot.index.cache, so when I try to bring in the original file, a 'cat' does no harm. Moreover, I've tried to do this will all files on the virtual machine, same results - no harm.
On Thu, Dec 22, 2011 at 5:13 PM, Timo Sirainen tss@iki.fi wrote:
On 22.12.2011, at 18.06, hydra wrote:
After the virtual machine reboot, the CPU usage is normal again, but just until doveadm is launched again (it was run from cron). Sorry for the partition/drive terminology mess up.
So it's reproducible? You could see if you can reproduce it by simply reading the file(s) that are causing problems. Like with zsh: cat **/* > /dev/null
On 12/23/2011 4:12 AM, hydra wrote:
Yes, it is reproducible.
About the file, I have tarred up the original maildir before I deleted the dovecot.index.cache, so when I try to bring in the original file, a 'cat' does no harm. Moreover, I've tried to do this will all files on the virtual machine, same results - no harm.
Is selinux running? If so, turn it off and see if that helps. And turn off anything/everything else on this Linux VM that's not running on the other Dovecot VMs you mentioned are working properly. Let's do some process of elimination here.
-- Stan
Selinux wasn't running.
I can reproduce the problem with this setup:
- restore the maildir with the broken index.cache
- run
/usr/bin/doveadm -D expunge -A mailbox Trash savedbefore 1w
in an infinite loop - use horde webmail to access the mail with IMAP
- randomly display messages
I could reproduce the problem on: Linux-2.6.37-hardened-r7 Linux-2.6.39-hardened-r8 Linux-3.0.4-hardened-r5 Linux-3.0.6-gentoo Linux-3.1.4-gentoo
However I failed to reproduce the problem on: Linux-2.6.36-hardened-r9
So it seems it was introduced in 2.6.37. I'm going to inspect it later on. By wishing you a God blessed merry Christmas I thank you for the support ;)
On Fri, Dec 23, 2011 at 11:36 AM, Stan Hoeppner stan@hardwarefreak.comwrote:
Yes, it is reproducible.
About the file, I have tarred up the original maildir before I deleted
On 12/23/2011 4:12 AM, hydra wrote: the
dovecot.index.cache, so when I try to bring in the original file, a 'cat' does no harm. Moreover, I've tried to do this will all files on the virtual machine, same results - no harm.
Is selinux running? If so, turn it off and see if that helps. And turn off anything/everything else on this Linux VM that's not running on the other Dovecot VMs you mentioned are working properly. Let's do some process of elimination here.
-- Stan
On 12/23/2011 11:08 AM, hydra wrote:
Selinux wasn't running.
I can reproduce the problem with this setup:
- restore the maildir with the broken index.cache
- run
/usr/bin/doveadm -D expunge -A mailbox Trash savedbefore 1w
in an infinite loop- use horde webmail to access the mail with IMAP
- randomly display messages
I could reproduce the problem on: Linux-2.6.37-hardened-r7 Linux-2.6.39-hardened-r8 Linux-3.0.4-hardened-r5 Linux-3.0.6-gentoo Linux-3.1.4-gentoo
However I failed to reproduce the problem on: Linux-2.6.36-hardened-r9
So it seems it was introduced in 2.6.37. I'm going to inspect it later on. By wishing you a God blessed merry Christmas I thank you for the support ;)
Aha, cool. Glad you're making some progress, and that you now know it's kernel related in some way, just as Timo suspected. Keep in mind it may not necessarily be a code change problem in 2.6.37 onward. Could be a compiler/linker issue. It could also be the default EXT4 journal behavior changed (I haven't checked), or something similar.
Merry Christmas to you as well hydra. I don't really deserve much credit for support here. I've just nudged you in the direction you already knew you should be heading. :)
Best regards,
-- Stan
participants (3)
-
hydra
-
Stan Hoeppner
-
Timo Sirainen