Timo (and anyone else who feels like chiming in),
I was just wondering if you'd be able to tell me if the amount of corruption I see on a daily basis is what you consider "average" for our current setup and traffic. Now that we are no longer experiencing any core dumps with the latest patches since our migration from courier two months ago, I'd like to know what is expected as operational norms. Prior to this we had never used Dovecot, so I have nothing to go on.
Our physical setup is 10 Centos 5.4 x86_64 IMAP/POP servers, all with the same NFS backend where the index, control, and Maildir's for the users reside. Accessing this are direct connections from clients, plus multiple squirrelmail webservers, and pine users, all at the same time with layer4 switch connection load balancing.
Each server has an average of about 400 connections, for a total of around concurrent 4000 during a normal business day. This is out of a possible user population of about 15,000.
All our dovecot servers syslog to one machine, and on average I see about 50-75 instances of file corruption per day. I'm not counting each line, since some instances of corruption generate a log message for each uid that's wrong. This is just me counting "user A was corrupted once at 10:00, user B was corrupted at 10:25" for example.
Examples of the corruption are as follows:
########### Corrupted transaction log file ..../dovecot/.INBOX/dovecot.index.log seq 28: Invalid transaction log size (32692 vs 32800): ...../dovecot/.INBOX/dovecot.index.log (sync_offset=32692)
Corrupted index cache file ...../dovecot/.Sent Messages/dovecot.index.cache: Corrupted physical size for uid=624: 0 != 53490263
Corrupted transaction log file ..../dovecot/.INBOX/dovecot.index.log seq 66: Unexpected garbage at EOF (sync_offset=21608)
Corrupted transaction log file ...../dovecot/.Trash.RFA/dovecot.index.log seq 2: indexid changed 1264098644 -> 1264098664 (sync_offset=0)
Corrupted index cache file ...../dovecot/.INBOX/dovecot.index.cache: invalid record size
Corrupted index cache file ...../dovecot/.INBOX/dovecot.index.cache: field index too large (33 >= 19)
Corrupted transaction log file ..../dovecot/.INBOX/dovecot.index.log seq 40: record size too small (type=0x0, offset=5788, size=0) (sync_offset=5812) ##########
These are most of the unique messages I could find, although the majority are the same as the first two I posted. So, my question, is this normal for a setup such as ours? I've been arguing with my boss over this since the switch. My opinion is that with a setup such as ours where a user can be logged in using Thunderbird, Squirrelmail, and their Blackberry all concurrently at the same time, there will always be the occasional index/log corruption.
Unfortunately, he is of the opinion that there should rarely be any and there is a design flaw in how Dovecot is designed to work with multiple services with an NFS backend.
What has been your experience so far?
Thanks, -Dave