[Dovecot] Corrupted transaction log file
Hello
I run dovecot 2.0.6 on a two-machine cluster using OCFS2 as the file system. I have some error messages like these in my log:
Oct 25 01:07:50 box5 dovecot: lmtp(8886, suporte=100br.com@lmtp1.prv.f1.k8.com.br): Error: Corrupted transaction log file /var/lib/imap/user/a3/suporte=100br.com/stor age/dovecot.map.index.log seq 29: Transaction log corrupted unexpectedly at 21536: Invalid size 0 (type=0) (sync_offset=21908)
Oct 25 01:07:50 box5 dovecot: lmtp(8886, suporte=100br.com@lmtp1.prv.f1.k8.com.br): Error: Index /var/lib/imap/user/a3/suporte=100br.com/storage/dovecot.map.index: Lost log for seq=29 offset=21796
Oct 25 01:07:55 box5 dovecot: lmtp(8885, suporte=100br.com@lmtp1.prv.f1.k8.com.br): Error: Log synchronization error at seq=30,offset=312 for /var/lib/imap/user/a3/su porte=100br.com/storage/dovecot.map.index: Append with UID 5404, but next_uid = 5405
Oct 25 02:37:53 box5 dovecot: lmtp(8885, suporte=hostnet.com.br@lmtp1.prv.f1.k8.com.br): Error: Log synchronization error at seq=47,offset=22944 for /var/lib/imap/user/03/suporte=hostnet.com.br/storage/dovecot.map.index: Extension record update for invalid uid=21815
These two accounts happen to be accessed simultaneously by a number of people, but there are similar errors for "normal" accounts too, just not the same amount of logs. Is it not considered safe to do this kind of simultaneous access in a distributed filesystem like OCFS2?
Is there a way to fix this error? Would a "doveadm force-resync" do it?
Thanks, Andre
On 26.10.2010, at 19.58, Andre Nathan wrote:
I run dovecot 2.0.6 on a two-machine cluster using OCFS2 as the file system. I have some error messages like these in my log:
Oct 25 01:07:50 box5 dovecot: lmtp(8886, suporte=100br.com@lmtp1.prv.f1.k8.com.br): Error: Corrupted transaction log file /var/lib/imap/user/a3/suporte=100br.com/stor age/dovecot.map.index.log seq 29: Transaction log corrupted unexpectedly at 21536: Invalid size 0 (type=0) (sync_offset=21908)
Have you set mmap_disable=yes?
These two accounts happen to be accessed simultaneously by a number of people, but there are similar errors for "normal" accounts too, just not the same amount of logs. Is it not considered safe to do this kind of simultaneous access in a distributed filesystem like OCFS2?
Only if it works correctly, and it doesn't really look like it is..
Is there a way to fix this error? Would a "doveadm force-resync" do it?
Those errors should fix themselves automatically. Still, it's not very good if they keep happening. Sooner or later they will cause user visible problems.
On Tue, 2010-10-26 at 22:39 +0200, Timo Sirainen wrote:
Have you set mmap_disable=yes?
Yep.
Those errors should fix themselves automatically. Still, it's not very good if they keep happening. Sooner or later they will cause user visible problems.
They're not happening all the time. It showed up two or three times in the logs during the night.
Thanks, Andre
On Tue, 2010-10-26 at 23:52 -0200, Andre Nathan wrote:
They're not happening all the time. It showed up two or three times in the logs during the night.
A few errors of this kind appeared in the logs last night. The "Log synchronization error" appears more frequently than "Corrupted transaction log file", but I assume the former is a consequence of the latter.
This really seems to be related to the shared accounts. Our load balancer makes no attempt to send the same users to the same server; it just sends connections based on server load. This allows two simultaneous connections to access an account, one on each server. Given a distributed FS, this should be OK, right? At least in theory it's no different than two CPU cores accessing the same account in a single server.
Best regards, Andre
On Tue, 2010-10-26 at 23:52 -0200, Andre Nathan wrote:
On Tue, 2010-10-26 at 22:39 +0200, Timo Sirainen wrote:
Have you set mmap_disable=yes?
Yep.
Just out of curiosity, is this setting really needed, or is it for performance reasons? OCFS2 claims to support mmap:
http://www.oracle.com/us/technologies/linux/025995.htm
Regards, Andre
On 27.10.2010, at 15.48, Andre Nathan wrote:
On Tue, 2010-10-26 at 23:52 -0200, Andre Nathan wrote:
On Tue, 2010-10-26 at 22:39 +0200, Timo Sirainen wrote:
Have you set mmap_disable=yes?
Yep.
Just out of curiosity, is this setting really needed, or is it for performance reasons? OCFS2 claims to support mmap:
If mmap is supported, it's not necessary, but might still be better for performance (or might not, depending on filesystem's mmap implementation..)
This really seems to be related to the shared accounts. Our load balancer makes no attempt to send the same users to the same server; it just sends connections based on server load. This allows two simultaneous connections to access an account, one on each server. Given a distributed FS, this should be OK, right? At least in theory it's no different than two CPU cores accessing the same account in a single server.
In theory sure.. But Dovecot's indexes have already been heavily stress tested in local filesystems with multiple CPU cores and it should be stable nowadays. If you can easily break it with a clusterfs, the problem is clearly the clusterfs itself.
You could easily try this with imaptest http://imapwiki.org/ImapTest by running it for the same mailbox in two OCFS2 nodes at the same time and see if Dovecot immediately starts logging errors.
participants (2)
-
Andre Nathan
-
Timo Sirainen