On Sunday 05 October 2008 07:16:23 Timo Sirainen imposed structure on a stream of electrons, yielding:
On Wed, 2008-10-01 at 21:15 -0400, Karsten Künne wrote:
we recently upgraded dovecot from 1.0.5 to 1.1.3+latest patches. After the upgrade I saw some errors in the logs and I'm not sure whether they're a cause for concern. Our setup is 2 servers running openSUSE 11.0 and a NetApp Filer as storage backend. Following is the config:
So it's possible that two users are accessing the same mailbox from two different servers at the same time?
It's not impossible but it should be very rare. Users connect to a generic server name which resolves to two different IP numbers via round-robin DNS. It's not perfect but it has worked well for us in the past.
It was working the same way with v1.0 too? Without errors? That sounds strange, since that kind of a setup should have been much more broken with v1.0.
Well, I didn't really look for errors with 1.0 if no user complained so I can't say for sure that there were no errors. At least, no user complained :-) , and that's the most important. I haven't gotten any real complaints with 1.1 either but I'm looking a bit more now because of the new version.
mail_location: maildir:~/.Maildir:CONTROL=/var/mailctl/%n:INDEX=/var/mailctl/%n
I guess /var/mailctl is also on netapp?
Yes, but it has no quotas set.
mail_nfs_storage: yes mail_nfs_index: yes
These should have prevented these errors, but I guess they don't fully work everywhere then..
Oct 1 07:39:27 host1 dovecot: IMAP(user1): Invalid transaction log size (83856 vs 83160): /var/mailctl/corey/.INBOX/dovecot.index.log
..
None of these should ever happen.
After further examination the sequence of errors seems to be like this:
First:
Oct 5 23:10:14 host1 deliver(usera): Corrupted transaction log file /var/mailctl/usera/.INBOX/dovecot.index.log: Unexpected garbage at EOF
Second:
Oct 5 23:10:14 host1 deliver(usera): Fixed index file /var/mailctl/usera/.INBOX/dovecot.index: log_file_seq 222 -> 223
Third:
Oct 5 23:28:07 host1 deliver(usera): Transaction log file /var/mailctl/usera/.INBOX/dovecot.index.log: marked corrupted
Fourth:
Oct 5 23:28:07 host1 dovecot: IMAP(usera): Invalid transaction log size (608 vs 240): /var/mailctl/usera/.INBOX/dovecot.index.log Oct 5 23:28:07 host1 dovecot: IMAP(usera): Invalid transaction log size (608 vs 240): /var/mailctl/usera/.INBOX/dovecot.index.log ....
and so on.
I observed this pattern at least 3 times. It IS possible that the index files were already corrupted by 1.0.5. Hard to tell now.
Oct 1 09:14:55 host1 dovecot: IMAP(user2): Broken file /var/mailctl/user2/.INBOX/dovecot-uidlist line 15299: UIDs not ordered (206464 > 206464)
Oct 1 09:16:43 host1 dovecot: IMAP(user3): Broken file /var/mailctl/user3/.Trash/dovecot-uidlist line 22: UID larger than next_uid (153866 >= 153866)
These especially could cause visible problems to clients.
That's not good news. It probably doesn't matter for the Trash folder but the INBOX is definitely more critical. So far, I haven't heard any complaints.
After we upgraded our second pair of mailservers I believe I found an explanation for at least some of these errors. We did all of the upgrades by upgrading just one of the servers and then the second one the following day. So, for one day we had a mixture of 1.1.3 and 1.0.5 running, accessing the same mailstore and the same control and index files. I guess that wasn't such a great setup and could have lead to some corruption.
Oct 1 07:29:05 host1 dovecot: IMAP(user4): nfs_flush_fcntl: fcntl(/var/mailctl/user4/.INBOX/dovecot.index.cache, F_RDLCK) failed: Interrupted system call
Do you get a lot of these? This means it was trying to flush NFS read cache by read-locking the file, but it failed after waiting for 60 seconds. All of these locks should be really short, so if it's timing out things aren't working very well.
Fortunately, these errors disappeared in the last 3 days. I'll keep an eye out for them and I'll also check the NetApp logs.
Karsten.
Paul's Law: In America, it's not how much an item costs, it's how much you save.