[Dovecot] Maildir + NFS + multiple machines = spectacular failure

Apps Lists appslists at neonova.net
Mon May 1 19:20:51 EEST 2006


Hi Rich.

Glad to see that helped out.  I can still introduce errors with IMAP even
with that setting.

We just finished setting up GFS and the results are the same:

May  1 11:33:11 dev4 dovecot: IMAP(jtest): fcntl() failed with file
/var/mailstore/72/af/375887/Maildir/dovecot.index: No such file or
directory
May  1 11:33:13 dev4 dovecot: IMAP(jtest): Corrupted transaction log file
/var/mailstore/72/af/375887/Maildir/dovecot.index.log: Append with UID
2703, but next_uid = 2704
May  1 11:33:15 dev4 dovecot: IMAP(jtest): Corrupted transaction log file
/var/mailstore/72/af/375887/Maildir/dovecot.index.log: Append with UID
2704, but next_uid = 2705
May  1 11:35:16 dev4 dovecot: IMAP(jtest): fcntl() failed with file
/var/mailstore/72/af/375887/Maildir/dovecot.index: No such file or
directory
May  1 11:35:16 dev4 dovecot: IMAP(jtest): file mail-index.c: line 865
(mail_index_sync_from_transactions): assertion failed: (prev_seq <=
max_seq && (prev_seq != max_seq || prev_offset <= max_offset))
....
May  1 11:40:19 dev4 dovecot: IMAP(jtest): file_dotlock_open() failed with
file /var/mailstore/72/af/375887/Maildir/dovecot.index.log: Resource
temporarily unavailable
May  1 11:40:19 dev4 dovecot: child 25918 (imap) killed with signal 11

I'm not entirely sure why I'm seeing file_dotlock_open() calls.  We're
using fcntl.

For the record:  Three members of the cluster. Two machines are running
Dovecot, mounting the filesystem with GFS.  I have another random machine
that connects to both machines with IMAP.  The program connects, gets a
list of folders and APPENDs a message.  It then disconnects and repeats. 
I run one instance of this test against dovecot#1 and another against
dovecot#2, and can get the above failure messages within seconds.

This is pretty much the same behaviour we've seen with NFS.

Steve


> Adding "noac" to our /etc/fstab NFS mounts seems to have helped our
> index errors so far:
>
> nfsserver:/var/mail      /var/mail     nfs
> rw,hard,intr,rsize=8192,wsize=8192,noac 0 0
>
> Over the weekend we had 1,465,031 POP3 sessions across 3 servers, and
> only logged 15 "Corrupted" errors. They were of these types:
>
> Corrupted transaction log file /var/mail/(user)/Maildir/
> dovecot.index.log: unexpected end of file while reading header
> Corrupted transaction log file /var/mail/(user)/Maildir/
> dovecot.index.log: end_offset (236) > current sync_offset (200)
> Corrupted transaction log file /var/mail/(user)/Maildir/
> dovecot.index.log: record size too small (type=0x8080, offset=7260,
> size=0)
> Corrupted transaction log file /var/mail/(user)/Maildir/
> dovecot.index.log: hdr.size too large (0)
>
> We also had 1,929 IMAP sessions from our testing support staff, and
> only 2 errors:
>
> Corrupted transaction log file /var/mail/(user)/Maildir/
> dovecot.index.log: record size too small (type=0x5ef9, offset=200,
> size=0)
> Corrupted transaction log file /var/mail/(user)/Maildir/.Junk E-mail/
> dovecot.index.log: end_offset (15292) > current sync_offset (15236)
>
> Rich
>
> On Apr 28, 2006, at 11:27 AM, Apps Lists wrote:
>
>> Rich,
>>
>>   We're also in the process of setting up a GFS/Cluster with
>> CentOS.  I'm
>> hopeful this will eliminate some of the issues.
>>
>>   I'll keep everyone posted.
>>
>> Steve
>>
>>> Hi Steve,
>>>
>>> We'll give that a try.  I'm sure users would prefer a slightly slower
>>> mail experience if it ensured accurate mail lists and reduced the
>>> chance for lost mail.
>>>
>>> Rich Sandberg
>>> richs at whidbey.net
>>> Systems & Network Administrator
>>> Whidbey Telecom Network Operations
>>>
>>>
>>>
>>> On Apr 28, 2006, at 9:37 AM, Apps Lists wrote:
>>>
>>>> Hi Rich.
>>>>
>>>> I've seen a lot of the same IMAP errors... haven't played with POP
>>>> yet.
>>>>
>>>> You might want to try adding "noac" to the mount options... your
>>>> line..
>>>>
>>>> nfshost:/var/mail      /var/mail     nfs
>>>> rw,hard,intr,rsize=8192,wsize=8192 0 0
>>>>
>>>> .... would become:
>>>>
>>>>
>>>> nfshost:/var/mail      /var/mail     nfs
>>>> rw,hard,intr,rsize=8192,wsize=8192,noac 0 0
>>>>
>>>>
>>>> .... and see if that helps.  I haven't seen much if any difference.
>>>>
>>>> What I don't understand is how some people are running what appears
>>>> to be
>>>> the same configuration ... RHEL/CentOS ... NFS3, 2+ Dovecot
>>>> machines with
>>>> IMAP and have no problems.
>>>>
>>>> I'd love to get to the bottom of this.
>>>>
>>>> Steve
>>>>
>>>>> We're using 2.6.9-34, the latest RHEL4 kernel, over NFSv3 with 3
>>>>> Dovecot servers. Our most concerning episode was a failed APPEND
>>>>> operation (with a set of folders containing 200 messages):
>>>>>
>>>>> IMAP(username): Corrupted transaction log file /var/mail/username/
>>>>> Maildir/dovecot.index.log: unexpected end of file while reading
>>>>> header
>>>>>
>>>>> That resulted in one of the new APPENDed mail folders and content
>>>>> being lost.  We see those errors at other times too, although they
>>>>> don't always end in a noticeable failure.
>>>>>
>>>>> Continuous other errors we see that might be NFS related:
>>>>>
>>>>> IMAP(username): close() failed with index cache file /var/mail/
>>>>> username/Maildir/.Deleted Messages/dovecot.index.cache: Stale   NFS
>>>>> file
>>>>> handle
>>>>> IMAP(username): Corrupted index cache file /var/mail/username/
>>>>> Maildir/.OldMail/dovecot.index.cache: invalid record size
>>>>> IMAP(username): close() failed with index file /var/mail/username/
>>>>> Maildir/.INBOX.Security/dovecot.index: Bad file descriptor
>>>>>
>>>>> We also see errors with POP3, which would rarely benefit from
>>>>> indexes
>>>>> (since 92% of our users RETR-and-DELE messages):
>>>>>
>>>>> POP3(username): file_dotlock_create() failed with file /var/mail/
>>>>> username/Maildir/dovecot.index.log: Stale NFS file handle
>>>>> POP3(username): Corrupted transaction log file /var/mail/username/
>>>>> Maildir/dovecot.index.log: record size too small (type=0x8080,
>>>>> offset=9056, size=0)
>>>>> POP3(username): Our dotlock file /var/mail/username/Maildir/
>>>>> dovecot.index.log.lock was deleted (kept it -118 secs)
>>>>> POP3(username): Timeout while waiting for release of dotlock for
>>>>> transaction log file /var/mail/username/Maildir/dovecot.index.log
>>>>>
>>>>> We've tried manually removing "dovecot*" from all folders, allowing
>>>>> the indexes to be re-built, but the errors return.
>>>>>
>>>>> Excerpt from our config:
>>>>>
>>>>> mail_cache_fields mail_never_cache_fields mail_cache_min_mail_count
>>>>>>> 1000 mailbox_idle_check_interval = 10 mail_full_filesystem_access
>>>>> = no
>>>>> mail_read_mmaped = no
>>>>> mmap_disable = yes
>>>>> mmap_no_write = no
>>>>> lock_method = dotlock
>>>>> maildir_stat_dirs = yes
>>>>> maildir_copy_with_hardlinks = no
>>>>>
>>>>> Our NFS mount:
>>>>>
>>>>> nfshost:/var/mail      /var/mail     nfs
>>>>> rw,hard,intr,rsize=8192,wsize=8192 0 0
>>>>>
>>>>> I've read some of the list archives about disabling the index
>>>>> files.
>>>>> We realize that indexes should be beneficial, but we'd like to at
>>>>> least experiment with a "no index" install.  This would solve
>>>>> some of
>>>>> the NFS locking issues, eliminate the need to modify our MTA
>>>>> behavior, and remove (potentially) millions of files from the
>>>>> system.
>>>>>
>>>>> Did anyone discover a method to cleanly disable the indexes? At
>>>>> least
>>>>> for POP3?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Rich Sandberg
>>>>> richs at whidbey.net
>>>>>
>>>>>
>>>>> On Apr 28, 2006, at 6:59 AM, Luis Melendez wrote:
>>>>>
>>>>>> Hi all.
>>>>>> We also had problems with  client NFS locking using RHEL4, kernel
>>>>>> 2.6.9
>>>>>> Some of them were solved updating the kelnel from RedHat.
>>>>>> Hope this helps.
>>>>>>
>>>>>> Regards.
>>>>>>
>>>>>> Apps Lists wrote:
>>>>>>
>>>>>>> I'm running beta7 on two machines, with maildir on NFS.  I have
>>>>>>> lockd
>>>>>>> running on all machines.  I've found that Dovecot is highly
>>>>>>> unstable with
>>>>>>> NFS when accessing a mailbox on more than one machine at the same
>>>>>>> time.
>>>>>>>
>>>>>>> Both dovecot machines have:
>>>>>>>
>>>>>>> mmap_disable = yes
>>>>>>> lock_method = fcntl
>>>>>>>
>>>>>>> NFS is version 3, exported from a third linux machine.  All
>>>>>>> machines are
>>>>>>> running 2.6.9 kernel.
>>>>>>>
>>>>>>> Any ideas what's going wrong here?
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>>
>>>>>>> *** dev4 ***
>>>>>>>
>>>>>>> Apr 28 09:17:40 dev4 dovecot: IMAP(jtest): Duplicate file in
>>>>>>> uidlist file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist:
>>>>>>> 1146230258.P2889Q0M407851.dev4.neonova.net:2,
>>>>>>>
>>>>>>> Apr 28 09:17:41 dev4 dovecot: IMAP(jtest): Corrupted transaction
>>>>>>> log file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index.log: Extension
>>>>>>> introduction for unknown id 0
>>>>>>>
>>>>>>> Apr 28 09:17:41 dev4 dovecot: IMAP(jtest): file mail-index.c:
>>>>>>> line
>>>>>>> 883
>>>>>>> (mail_index_sync_from_transactions): assertion failed:
>>>>>>> (hdr.log_file_int_offset == (*map)->hdr.log_file_int_offset)
>>>>>>>
>>>>>>>
>>>>>>> Apr 28 09:17:41 dev4 dovecot: child 2889 (imap) killed with
>>>>>>> signal 6
>>>>>>>
>>>>>>>
>>>>>>> *** dev3 ***
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Apr 28 09:17:39 dev3 dovecot: IMAP(jtest): Duplicate file in
>>>>>>> uidlist file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist:
>>>>>>> 1146230254.P2888Q0M651598.dev4.neonova.net
>>>>>>>
>>>>>>> Apr 28 09:17:41 dev3 dovecot: IMAP(jtest): Fixed index file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index:
>>>>>>> first_recent_uid_lowwater 4036 -> 1772
>>>>>>>
>>>>>>> Apr 28 09:17:44 dev3 dovecot: IMAP(jtest):
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist: next_uid was
>>>>>>> lowered
>>>>>>> (4036 -> 1774)
>>>>>>>
>>>>>>> Apr 28 09:17:45 dev3 dovecot: IMAP(jtest): Duplicate file in
>>>>>>> uidlist file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist:
>>>>>>> 1146145480.P5976Q0M245663.dev3.neonova.net:2,
>>>>>>>
>>>>>>> Apr 28 09:17:46 dev3 dovecot: IMAP(jtest): Fixed index file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index:
>>>>>>> first_recent_uid_lowwater 4038 -> 1774
>>>>>>>
>>>>>>> Apr 28 09:17:47 dev3 dovecot: IMAP(jtest):
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist: next_uid was
>>>>>>> lowered
>>>>>>> (4038 -> 1775)
>>>>>>>
>>>>>>> Apr 28 09:17:48 dev3 dovecot: IMAP(jtest): Duplicate file in
>>>>>>> uidlist file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist:
>>>>>>> 1146145480.P5976Q0M245663.dev3.neonova.net:2,
>>>>>>>
>>>>>>> Apr 28 09:17:49 dev3 dovecot: IMAP(jtest):
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist: next_uid was
>>>>>>> lowered
>>>>>>> (4039 -> 1775)
>>>>>>>
>>>>>>> Apr 28 09:17:50 dev3 dovecot: IMAP(jtest): Corrupted transaction
>>>>>>> log file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index.log: Extension
>>>>>>> introduction for unknown id 0
>>>>>>>
>>>>>>> Apr 28 09:17:50 dev3 dovecot: IMAP(jtest): file mail-index.c:
>>>>>>> line
>>>>>>> 881
>>>>>>> (mail_index_sync_from_transactions): assertion failed:
>>>>>>> (hdr.messages_count
>>>>>>> == (*map)->hdr.messages_count)
>>>>>>>
>>>>>>> Apr 28 09:17:50 dev3 dovecot: IMAP(jtest):
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot-uidlist: next_uid was
>>>>>>> lowered
>>>>>>> (4039 -> 1775)
>>>>>>>
>>>>>>> Apr 28 09:17:51 dev3 dovecot: IMAP(jtest): Corrupted transaction
>>>>>>> log file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index.log: Append
>>>>>>> with
>>>>>>> UID
>>>>>>> 1775, but next_uid = 4039
>>>>>>>
>>>>>>> Apr 28 09:17:51 dev3 dovecot: IMAP(jtest): Corrupted transaction
>>>>>>> log file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index.log: Append
>>>>>>> with
>>>>>>> UID
>>>>>>> 1775, but next_uid = 4039
>>>>>>>
>>>>>>> Apr 28 09:17:52 dev3 dovecot: IMAP(jtest): Unexpected
>>>>>>> transaction log
>>>>>>> desync with index /var/mailstore/72/af/375887/Maildir/
>>>>>>> dovecot.index
>>>>>>>
>>>>>>> Apr 28 09:17:52 dev3 dovecot: IMAP(jtest): Disconnected:
>>>>>>> Mailbox   is
>>>>>>> in
>>>>>>> inconsistent state, please relogin.
>>>>>>>
>>>>>>> Apr 28 09:17:52 dev3 dovecot: IMAP(jtest): Fixed index file
>>>>>>> /var/mailstore/72/af/375887/Maildir/dovecot.index:
>>>>>>> first_recent_uid_lowwater 4039 -> 1776
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> +----------------------------------------------
>>>>>> ^-----------------------+
>>>>>> | Luis Meléndez Aganzo                         ^  Email:
>>>>>> luism at uco.es  |
>>>>>> | Servicio de Informática                      ^  Tlf: 34-(9)
>>>>>> 57-211022 |
>>>>>> | Analista - Área de Sistemas                  ^  Fax: 34-(9)
>>>>>> 57-218116 |
>>>>>> | Universidad de Córdoba (SPAIN)               ^  http://
>>>>>> www.uco.es    |
>>>>>> +----------------------------------------------
>>>>>> ^-----------------------+





More information about the dovecot mailing list