[Dovecot] Broken dovecot-uidlist files

Ulrich Zehl

20 Jan 2009 20 Jan '09

12:11 p.m.

I have just upgraded to 1.1.8, hoping to fix these types of errors:

Jan 20 10:39:27 laura deliver(xxxxxx@example.net): Broken file /srv/storage/mail/store/net/example/xx/xxxxxx/Maildir/dovecot-uidlist line 156: Invalid data: Jan 20 10:41:19 laura dovecot: POP3(yy.yy@example.at): Broken file /srv/storage/mail/store/at/example/yy.yy/Maildir/dovecot-uidlist line 2: Invalid data:

Unfortunately, they still pop up. Since I suspect these errors to be responsible for customers complaining about downloading mails twice, I'd like to fix them.

Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl

As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.

What can I do to solve this problem, or to analyze it further?

Ulrich

Show replies by date

Timo Sirainen

26 Jan 26 Jan

4:28 a.m.

On Tue, 2009-01-20 at 11:11 +0100, Ulrich Zehl wrote:

...

I have just upgraded to 1.1.8, hoping to fix these types of errors:

Jan 20 10:39:27 laura deliver(xxxxxx@example.net): Broken file /srv/storage/mail/store/net/example/xx/xxxxxx/Maildir/dovecot-uidlist line 156: Invalid data: Jan 20 10:41:19 laura dovecot: POP3(yy.yy@example.at): Broken file /srv/storage/mail/store/at/example/yy.yy/Maildir/dovecot-uidlist line 2: Invalid data:

Unfortunately, they still pop up. Since I suspect these errors to be responsible for customers complaining about downloading mails twice, I'd like to fix them.

Yes, probably the reason.

...

Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl

What about mail_nfs_storage=yes?

...

As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.

Are you also using Dovecot deliver?

Ulrich Zehl

3:29 p.m.

On Sun, Jan 25, 2009 at 09:28:27PM -0500, Timo Sirainen wrote:

...

On Tue, 2009-01-20 at 11:11 +0100, Ulrich Zehl wrote:

...
Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl

What about mail_nfs_storage=yes?

Yes, that's set too. (Sorry, forgot about that.)

...

...
As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.

Are you also using Dovecot deliver?

Yes, for almost all deliveries (see e.g. the first log entry in my original post). There's only one nightly script that delivers one message directly to some of the maildirs, but the errors also pop up on mailboxes that the script doesn't touch.

Timo Sirainen

5:02 p.m.

On Mon, 2009-01-26 at 14:29 +0100, Ulrich Zehl wrote:

...

On Sun, Jan 25, 2009 at 09:28:27PM -0500, Timo Sirainen wrote:

...
On Tue, 2009-01-20 at 11:11 +0100, Ulrich Zehl wrote:

...
Our setup has multiple servers sharing the same NFS-mounted directory for maildirs and indexes; we have set mmap_disable = yes dotlock_use_excl = yes fsync_disable = no mail_nfs_index = yes lock_method = fcntl

What about mail_nfs_storage=yes?

Yes, that's set too. (Sorry, forgot about that.)

Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?

...

...
...
As far as I can tell, for at least one of these errors, there was no concurrent access to the mailbox in question.

Are you also using Dovecot deliver?

Yes, for almost all deliveries (see e.g. the first log entry in my original post). There's only one nightly script that delivers one message directly to some of the maildirs, but the errors also pop up on mailboxes that the script doesn't touch.

I was just thinking that when you said there was no concurrent access did you also consider deliver processes and not just imap/pop3?

Ulrich Zehl

27 Jan 27 Jan

9:30 a.m.

On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:

...

Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?

On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:

nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0

...

I was just thinking that when you said there was no concurrent access did you also consider deliver processes and not just imap/pop3?

I just checked again. See below for an example of the logs for one mailbox.

$ grep 123456@example.net mail.log Jan 27 08:00:17 allina dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:00:19 allina dovecot: POP3(123456@example.net): Disconnected: Logged out top=0/0, retr=1/1283428, del=0/1, size=1283408 Jan 27 08:00:57 laura deliver(123456@example.net): msgid=<497EB077.000009.01164@ERNESTO>: saved mail to INBOX Jan 27 08:00:57 laura postfix-store/pipe[18734]: 753522531F: to=<123456@example.net>, relay=dovecot, delay=0.91, delays=0.28/0.06/0/0.57, dsn=2.0.0, status=sent (delivered via dovecot service) Jan 27 08:04:35 laura deliver(123456@example.net): msgid=<497EB05F.000005.01164@ERNESTO>: saved mail to INBOX Jan 27 08:04:35 laura postfix-store/pipe[19818]: 0ABB3251E0: to=<123456@example.net>, relay=dovecot, delay=0.42, delays=0.11/0.06/0/0.25, dsn=2.0.0, status=sent (delivered via dovecot service) Jan 27 08:15:16 ellia dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:15:16 ellia dovecot: POP3(123456@example.net): Broken file /srv/storage/mail/store/net/example/12/123456/Maildir/dovecot-uidlist line 2: Invalid data:

Ulrich Zehl

16 Feb 16 Feb

10:12 a.m.

On Tue, Jan 27, 2009 at 08:30:14AM +0100, Ulrich Zehl wrote:

...

On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:

...
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?

On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:

nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0

As much as I hate replying to myself, I'm at a loss here. What can I do to debug this issue?

Could this even be a problem on the NFS server side?

Timo Sirainen

21 Feb 21 Feb

12:50 a.m.

On Tue, 2009-01-27 at 08:30 +0100, Ulrich Zehl wrote:

...

On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:

...
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?

On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:

nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0

Setting actimeo=0 probably fixes this, but also probably increases the load a lot. actimeo=1 might work ok and reduce how often these problems happen, but not eliminate them completely.

Dovecot's nfs settings should avoid this problem though. You could see if upgrading your kernel helps. Some kernels have somewhat broken NFS code.

...

...
I was just thinking that when you said there was no concurrent access did you also consider deliver processes and not just imap/pop3?

I just checked again. See below for an example of the logs for one mailbox.

$ grep 123456@example.net mail.log Jan 27 08:00:17 allina dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:00:19 allina dovecot: POP3(123456@example.net): Disconnected: Logged out top=0/0, retr=1/1283428, del=0/1, size=1283408 Jan 27 08:00:57 laura deliver(123456@example.net): msgid=<497EB077.000009.01164@ERNESTO>: saved mail to INBOX

So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.

Ulrich Zehl

23 Feb 23 Feb

5:27 p.m.

On Fri, Feb 20, 2009 at 05:50:28PM -0500, Timo Sirainen wrote:

...

On Tue, 2009-01-27 at 08:30 +0100, Ulrich Zehl wrote:

...
On Mon, Jan 26, 2009 at 10:02:19AM -0500, Timo Sirainen wrote:

...
Perhaps the NFS cache flushing doesn't work then for some reason. What OS (kernel) are you using on the Dovecot servers? How big values have you set to attribute cache?

On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:

nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0

Setting actimeo=0 probably fixes this, but also probably increases the load a lot. actimeo=1 might work ok and reduce how often these problems happen, but not eliminate them completely.

Dovecot's nfs settings should avoid this problem though. You could see if upgrading your kernel helps. Some kernels have somewhat broken NFS code.

I did

mount -o remount,actimeo=0 /srv/storage

around 9 this morning, but I'm still seeing these errors pop up. (I verified that there are mailboxes where this happened more than once since 10 today.)

Is it time to upgrade my kernel, or are there other options left?

...

...
$ grep 123456@example.net mail.log Jan 27 08:00:17 allina dovecot: pop3-login: Login: user=<123456@example.net>, method=PLAIN, rip=80.x.x.x, lip=10.x.x.x Jan 27 08:00:19 allina dovecot: POP3(123456@example.net): Disconnected: Logged out top=0/0, retr=1/1283428, del=0/1, size=1283408 Jan 27 08:00:57 laura deliver(123456@example.net): msgid=<497EB077.000009.01164@ERNESTO>: saved mail to INBOX

So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.

Since the corrupted files are available for a little while (in the example, it was ~ 15 minutes), will it help if I repeatedly check all dovecot-uidlists and save those found to be corrupted to a special directory, so that we can see what the corruption actually is?

Timo Sirainen

7:10 p.m.

On Mon, 2009-02-23 at 16:27 +0100, Ulrich Zehl wrote:

...

...
...
On the client side, it's Linux 2.6.23.16. All attribute cache related values are at their default, as far as I can tell. The entry in fstab reads:

nfs-server:/srv/storage /srv/storage nfs rw,nfsvers=3,hard,intr,nosuid,noexec,nodev,noatime 0 0

Setting actimeo=0 probably fixes this, but also probably increases the load a lot. actimeo=1 might work ok and reduce how often these problems happen, but not eliminate them completely.

Dovecot's nfs settings should avoid this problem though. You could see if upgrading your kernel helps. Some kernels have somewhat broken NFS code.

I did

mount -o remount,actimeo=0 /srv/storage

To both servers?

...

...
So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.

Since the corrupted files are available for a little while (in the example, it was ~ 15 minutes), will it help if I repeatedly check all dovecot-uidlists and save those found to be corrupted to a special directory, so that we can see what the corruption actually is?

I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.

Ulrich Zehl

24 Feb 24 Feb

9:08 a.m.

On Mon, Feb 23, 2009 at 12:10:29PM -0500, Timo Sirainen wrote:

...

On Mon, 2009-02-23 at 16:27 +0100, Ulrich Zehl wrote:

...
I did

mount -o remount,actimeo=0 /srv/storage

To both servers?

There's three Dovecot servers (NFS clients), actually, and I remounted /srv/storage on all of them.

...

...
...
So allina modified dovecot-uidlist and soon afterwards laura probably was using a cached dovecot-uidlist and corrupted it.

Since the corrupted files are available for a little while (in the example, it was ~ 15 minutes), will it help if I repeatedly check all dovecot-uidlists and save those found to be corrupted to a special directory, so that we can see what the corruption actually is?

I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.

Okay, I will post results when I have them.

Ulrich Zehl

12:59 p.m.

On Tue, Feb 24, 2009 at 08:08:52AM +0100, Ulrich Zehl wrote:

...

On Mon, Feb 23, 2009 at 12:10:29PM -0500, Timo Sirainen wrote:

...
I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.

Okay, I will post results when I have them.

I found a lot of files that hat null bytes in them, but no other problems (which could be due to the fact that my uidlist checking script isn't all that great).

I have put two corrupted dovecot-uidlist and the corresponding log files, and the check script up at http://www.topfen.net/user/ul/dovecot/. I hope that's enough data.

Timo Sirainen

7:33 p.m.

On Tue, 2009-02-24 at 11:59 +0100, Ulrich Zehl wrote:

...

On Tue, Feb 24, 2009 at 08:08:52AM +0100, Ulrich Zehl wrote:

...
On Mon, Feb 23, 2009 at 12:10:29PM -0500, Timo Sirainen wrote:

...
I suppose looking at a couple of those could verify if it's really just NFS caching related corruption or something else.

Okay, I will post results when I have them.

I found a lot of files that hat null bytes in them, but no other problems (which could be due to the fact that my uidlist checking script isn't all that great).

I have put two corrupted dovecot-uidlist and the corresponding log files, and the check script up at http://www.topfen.net/user/ul/dovecot/. I hope that's enough data.

OK. That just confirms file size caching issue:

Server A read and cached file's size
Server B recreated it as smaller file
Server A wanted to append new entry to the file, but used the cached file size so it wrote past the end of file, causing those NUL bytes in the middle.

6072

Age (days ago)

6107

Last active (days ago)

List overview

11 comments

3 participants

participants (3)

Timo Sirainen
Ulrich Zehl
Ulrich Zehl

[Dovecot] Broken dovecot-uidlist files

Ulrich Zehl

Ulrich Zehl

mount -o remount,actimeo=0 /srv/storage

mount -o remount,actimeo=0 /srv/storage

mount -o remount,actimeo=0 /srv/storage

tags

participants (3)