[Dovecot] Broken dovecot-uidlist files
I have just upgraded to 1.1.8, hoping to fix these types of errors:
Jan 20 10:39:27 laura deliver(xxxxxx@example.net): Broken file /srv/storage/mail/store/net/example/xx/xxxxxx/Maildir/dovecot-uidlist line 156: Invalid data: Jan 20 10:41:19 laura dovecot: POP3(yy.yy@example.at): Broken file /srv/storage/mail/store/at/example/yy.yy/Maildir/dovecot-uidlist line 2: Invalid data:
Unfortunately, they still pop up. Since I suspect these errors to be responsible for customers complaining about downloading mails twice, I'd like to fix them.
Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl
As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.
What can I do to solve this problem, or to analyze it further?
Ulrich
On Tue, 2009-01-20 at 11:11 +0100, Ulrich Zehl wrote:
I have just upgraded to 1.1.8, hoping to fix these types of errors:
Jan 20 10:39:27 laura deliver(xxxxxx@example.net): Broken file /srv/storage/mail/store/net/example/xx/xxxxxx/Maildir/dovecot-uidlist line 156: Invalid data: Jan 20 10:41:19 laura dovecot: POP3(yy.yy@example.at): Broken file /srv/storage/mail/store/at/example/yy.yy/Maildir/dovecot-uidlist line 2: Invalid data:
Unfortunately, they still pop up. Since I suspect these errors to be responsible for customers complaining about downloading mails twice, I'd like to fix them.
Yes, probably the reason.
Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl
What about mail_nfs_storage=yes?
As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.
Are you also using Dovecot deliver?
On Sun, Jan 25, 2009 at 09:28:27PM -0500, Timo Sirainen wrote:
On Tue, 2009-01-20 at 11:11 +0100, Ulrich Zehl wrote:
Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl
What about mail_nfs_storage=yes?
Yes, that's set too. (Sorry, forgot about that.)
As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.
Are you also using Dovecot deliver?
Yes, for almost all deliveries (see e.g. the first log entry in my original post). There's only one nightly script that delivers one message directly to some of the maildirs, but the errors also pop up on mailboxes that the script doesn't touch.
On Mon, 2009-01-26 at 14:29 +0100, Ulrich Zehl wrote:
On Sun, Jan 25, 2009 at 09:28:27PM -0500, Timo Sirainen wrote:
On Tue, 2009-01-20 at 11:11 +0100, Ulrich Zehl wrote:
Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl
What about mail_nfs_storage=yes?
Yes, that's set too. (Sorry, forgot about that.)
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?
As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.
Are you also using Dovecot deliver?
Yes, for almost all deliveries (see e.g. the first log entry in my original post). There's only one nightly script that delivers one message directly to some of the maildirs, but the errors also pop up on mailboxes that the script doesn't touch.
I was just thinking that when you said there was no concurrent access did you also consider deliver processes and not just imap/pop3?
On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?
On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:
nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0
I was just thinking that when you said there was no concurrent access did you also consider deliver processes and not just imap/pop3?
I just checked again. See below for an example of the logs for one mailbox.
$ grep 123456@example.net mail.log Jan 27 08:00:17 allina dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:00:19 allina dovecot: POP3(123456@example.net): Disconnected: Logged out top=0/0, retr=1/1283428, del=0/1, size=1283408 Jan 27 08:00:57 laura deliver(123456@example.net): msgid=<497EB077.000009.01164@ERNESTO>: saved mail to INBOX Jan 27 08:00:57 laura postfix-store/pipe[18734]: 753522531F: to=<123456@example.net>, relay=dovecot, delay=0.91, delays=0.28/0.06/0/0.57, dsn=2.0.0, status=sent (delivered via dovecot service) Jan 27 08:04:35 laura deliver(123456@example.net): msgid=<497EB05F.000005.01164@ERNESTO>: saved mail to INBOX Jan 27 08:04:35 laura postfix-store/pipe[19818]: 0ABB3251E0: to=<123456@example.net>, relay=dovecot, delay=0.42, delays=0.11/0.06/0/0.25, dsn=2.0.0, status=sent (delivered via dovecot service) Jan 27 08:15:16 ellia dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:15:16 ellia dovecot: POP3(123456@example.net): Broken file /srv/storage/mail/store/net/example/12/123456/Maildir/dovecot-uidlist line 2: Invalid data:
On Tue, Jan 27, 2009 at 08:30:14AM +0100, Ulrich Zehl wrote:
On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?
On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:
nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0
As much as I hate replying to myself, I'm at a loss here. What can I do to debug this issue?
Could this even be a problem on the NFS server side?
On Tue, 2009-01-27 at 08:30 +0100, Ulrich Zehl wrote:
On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?
On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:
nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0
Setting actimeo=0 probably fixes this, but also probably increases the load a lot. actimeo=1 might work ok and reduce how often these problems happen, but not eliminate them completely.
Dovecot's nfs settings should avoid this problem though. You could see if upgrading your kernel helps. Some kernels have somewhat broken NFS code.
I was just thinking that when you said there was no concurrent access did you also consider deliver processes and not just imap/pop3?
I just checked again. See below for an example of the logs for one mailbox.
$ grep 123456@example.net mail.log Jan 27 08:00:17 allina dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:00:19 allina dovecot: POP3(123456@example.net): Disconnected: Logged out top=0/0, retr=1/1283428, del=0/1, size=1283408 Jan 27 08:00:57 laura deliver(123456@example.net): msgid=<497EB077.000009.01164@ERNESTO>: saved mail to INBOX
So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.
On Fri, Feb 20, 2009 at 05:50:28PM -0500, Timo Sirainen wrote:
On Tue, 2009-01-27 at 08:30 +0100, Ulrich Zehl wrote:
On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?
On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:
nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0
Setting actimeo=0 probably fixes this, but also probably increases the load a lot. actimeo=1 might work ok and reduce how often these problems happen, but not eliminate them completely.
Dovecot's nfs settings should avoid this problem though. You could see if upgrading your kernel helps. Some kernels have somewhat broken NFS code.
I did
# mount -o remount,actimeo=0 /srv/storage
around 9 this morning, but I'm still seeing these errors pop up. (I verified that there are mailboxes where this happened more than once since 10 today.)
Is it time to upgrade my kernel, or are there other options left?
$ grep 123456@example.net mail.log Jan 27 08:00:17 allina dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:00:19 allina dovecot: POP3(123456@example.net): Disconnected: Logged out top=0/0, retr=1/1283428, del=0/1, size=1283408 Jan 27 08:00:57 laura deliver(123456@example.net): msgid=<497EB077.000009.01164@ERNESTO>: saved mail to INBOX
So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.
Since the corrupted files are available for a little while (in the example, it was ~ 15 minutes), will it help if I repeatedly check all dovecot-uidlists and save those found to be corrupted to a special directory, so that we can see what the corruption actually is?
On Mon, 2009-02-23 at 16:27 +0100, Ulrich Zehl wrote:
On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:
nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0
Setting actimeo=0 probably fixes this, but also probably increases the load a lot. actimeo=1 might work ok and reduce how often these problems happen, but not eliminate them completely.
Dovecot's nfs settings should avoid this problem though. You could see if upgrading your kernel helps. Some kernels have somewhat broken NFS code.
I did
# mount -o remount,actimeo=0 /srv/storage
To both servers?
So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.
Since the corrupted files are available for a little while (in the example, it was ~ 15 minutes), will it help if I repeatedly check all dovecot-uidlists and save those found to be corrupted to a special directory, so that we can see what the corruption actually is?
I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.
On Mon, Feb 23, 2009 at 12:10:29PM -0500, Timo Sirainen wrote:
On Mon, 2009-02-23 at 16:27 +0100, Ulrich Zehl wrote:
I did
# mount -o remount,actimeo=0 /srv/storage
To both servers?
There's three Dovecot servers (NFS clients), actually, and I remounted /srv/storage on all of them.
So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.
Since the corrupted files are available for a little while (in the example, it was ~ 15 minutes), will it help if I repeatedly check all dovecot-uidlists and save those found to be corrupted to a special directory, so that we can see what the corruption actually is?
I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.
Okay, I will post results when I have them.
On Tue, Feb 24, 2009 at 08:08:52AM +0100, Ulrich Zehl wrote:
On Mon, Feb 23, 2009 at 12:10:29PM -0500, Timo Sirainen wrote:
I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.
Okay, I will post results when I have them.
I found a lot of files that hat null bytes in them, but no other problems (which could be due to the fact that my uidlist checking script isn't all that great).
I have put two corrupted dovecot-uidlist and the corresponding log files, and the check script up at http://www.topfen.net/user/ul/dovecot/. I hope that's enough data.
On Tue, 2009-02-24 at 11:59 +0100, Ulrich Zehl wrote:
On Tue, Feb 24, 2009 at 08:08:52AM +0100, Ulrich Zehl wrote:
On Mon, Feb 23, 2009 at 12:10:29PM -0500, Timo Sirainen wrote:
I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.
Okay, I will post results when I have them.
I found a lot of files that hat null bytes in them, but no other problems (which could be due to the fact that my uidlist checking script isn't all that great).
I have put two corrupted dovecot-uidlist and the corresponding log files, and the check script up at http://www.topfen.net/user/ul/dovecot/. I hope that's enough data.
OK. That just confirms file size caching issue:
- Server A read and cached file's size
- Server B recreated it as smaller file
- Server A wanted to append new entry to the file, but used the cached file size so it wrote past the end of file, causing those NUL bytes in the middle.
participants (3)
-
Timo Sirainen
-
Ulrich Zehl
-
Ulrich Zehl