[Dovecot] Maildir unreliability

Timo Sirainen

25 Oct 2004 25 Oct '04

3:02 a.m.

Looks like maildir can't be used very realiably without quite a lot of locking. Writing and scanning the directory would have to be locked, but reading wouldn't (as long as the file hasn't been renamed which would require scanning to find it). So much for "no locks needed"..

The problem is that opendir()/readdir() may temporarily not return some files if there has been changes in the directory since the opendir(). That means Dovecot thinks a message is expunged, while in fact it really isn't, and the next scan would usually show it again.

Currently when that happens, Dovecot usually prints an error message about it and rebuilds indexes. Of course, in real life clients aren't often bombing the same mailbox with tons of changes in multiple connections, which is usually needed to trigger this.

I wrote a test program which tests this:

http://dovecot.org/tmp/readdir.c

I'd like to hear if you can run it in some system without errors. I tested Linux 2.4 and 2.6 with ext2, ext3, xfs and reiser3, Solaris 8/ufs and OpenBSD 3.5/sparc64. Only OpenBSD passed the test, but I'm not sure if it's only because the computer was so slow and didn't switch between processes hard enough. I'd be especially interested about FreeBSD and various NFS systems.

If it actually works properly in some systems, I guess I'll make the extra locking configurable.

Attachments:

PGP.sig (application/pgp-signature — 186 bytes)

Show replies by date

Gregory Bond

25 Oct 25 Oct

3:13 a.m.

Timo Sirainen wrote:

...

http://dovecot.org/tmp/readdir.c

I'd like to hear if you can run it in some system without errors. I tested Linux 2.4 and 2.6 with ext2, ext3, xfs and reiser3, Solaris 8/ufs and OpenBSD 3.5/sparc64. Only OpenBSD passed the test, but I'm not sure if it's only because the computer was so slow and didn't switch between processes hard enough. I'd be especially interested about FreeBSD and various NFS systems.

FreeBSD 4.10 stable:

To home directory NFS mounted from a NetApp: about 3000 error lines
To /tmp (ufs on ATA disk): 15 errors

Solaris 8/NFS to the NetApp:

470 errors

Peter Hessler

3:37 a.m.

On Mon, 25 Oct 2004 03:02:31 +0300 Timo Sirainen <tss@iki.fi> wrote:

:Looks like maildir can't be used very realiably without quite a lot of :locking. Writing and scanning the directory would have to be locked, :but reading wouldn't (as long as the file hasn't been renamed which :would require scanning to find it). So much for "no locks needed".. : :The problem is that opendir()/readdir() may temporarily not return some :files if there has been changes in the directory since the opendir(). :That means Dovecot thinks a message is expunged, while in fact it :really isn't, and the next scan would usually show it again. : :Currently when that happens, Dovecot usually prints an error message :about it and rebuilds indexes. Of course, in real life clients aren't :often bombing the same mailbox with tons of changes in multiple :connections, which is usually needed to trigger this. : :I wrote a test program which tests this: : :http://dovecot.org/tmp/readdir.c : :I'd like to hear if you can run it in some system without errors. I :tested Linux 2.4 and 2.6 with ext2, ext3, xfs and reiser3, Solaris :8/ufs and OpenBSD 3.5/sparc64. Only OpenBSD passed the test, but I'm :not sure if it's only because the computer was so slow and didn't :switch between processes hard enough. I'd be especially interested :about FreeBSD and various NFS systems. : :If it actually works properly in some systems, I guess I'll make the :extra locking configurable. :

gir.theapt.org:phessler@/usr/home/phessler> ./readdir 28751: File re-appeared: -> 983:2,S (2) 4017: File re-appeared: -> 707:2,S (2) 4017: File re-appeared: -> 724:2, (2) 4017: File re-appeared: -> 800:2,S (2) 30179: File re-appeared: 429:2,S -> 429:2, (2)

This is on OpenBSD/macppc -current, the partition is mounted as: /dev/wd0h on /usr/home type ffs (local, noatime, nodev, nosuid, softdep)

-- These days the necessities of life cost you about three times what they used to, and half the time they aren't even fit to drink.

loo

10:51 a.m.

Hi,

All of them failed on local (ffs, softdep) and NFS (hosted on OpenBSD 3.6) :

OpenBSD 3.6 i386 OpenBSD 3.6 amd64 OpenBSD 3.1 i386 Darwin 7.5.0 Power Macintosh powerppc NetBSD 2.99.10 i386

Alex S Moore

5:14 p.m.

On Mon, 2004-10-25 at 03:02 +0300, Timo Sirainen wrote:

...

Looks like maildir can't be used very realiably without quite a lot of locking. Writing and scanning the directory would have to be locked, but reading wouldn't (as long as the file hasn't been renamed which would require scanning to find it). So much for "no locks needed"..

Here is the output from a Solaris 9 sparc box, which provides home directories and the imaps service for clients. Maildir is in the home directory. The maildir is mounted via nfs by mail clients. See attached.

I do not know what this means. Is there a problem?

Thanks, Alex

Miquel van Smoorenburg

5:43 p.m.

On 2004.10.25 02:02, Timo Sirainen wrote:

...

Looks like maildir can't be used very realiably without quite a lot of locking. Writing and scanning the directory would have to be locked, but reading wouldn't (as long as the file hasn't been renamed which would require scanning to find it). So much for "no locks needed"..

The problem is that opendir()/readdir() may temporarily not return some files if there has been changes in the directory since the opendir(). That means Dovecot thinks a message is expunged, while in fact it really isn't, and the next scan would usually show it again.

Currently when that happens, Dovecot usually prints an error message about it and rebuilds indexes. Of course, in real life clients aren't often bombing the same mailbox with tons of changes in multiple connections, which is usually needed to trigger this.

I wrote a test program which tests this:

http://dovecot.org/tmp/readdir.c

Well, there is a lockless way to do readdir, but it would mean buffering the entire readdir output in memory.

while(1) {
	/* Wait until directory is quiescent */
	while (1) {
		time(&t);
		/* chmod() invalidates the NFS attribute cache */
		fstat(dirhandle, &st);
		fchmod(dirhandle, st.st_mode);
		fstat(dirhandle, &st);
		if (st.st_mtime &lt; t)
			break;
		usleep(100000);
	}
	t = st.st_mtime;

	free_list(&files);
	while (readdir(dirhandle)) {
		add_list(&files, ...);
		.. build list in memory ..
	}

	fstat(dirhandle, &st);
	fchmod(dirhandle, &st.st_mode);
	fstat(dirhandle, &st);
} while (st.st_mtime > t);

(insert sleep() or usleep() where nessecary).

This basically just retries until you've done an entire readdir() on the directory without mtime changes.

At least under Linux, chmod invalidates the NFS attribute cache. On normal files, fcntl() locking should do the same, but I'm not sure you can lock directories, and not all servers do NFS locking, so chmod is probably the better choice. Testing would be needed, though.

Mike.

Timo Sirainen

5:58 p.m.

On 25.10.2004, at 17:43, Miquel van Smoorenburg wrote:

...

Well, there is a lockless way to do readdir, but it would mean buffering the entire readdir output in memory.

Dovecot needs to store them into memory in any case.

...

This basically just retries until you've done an entire readdir() on the directory without mtime changes.

Sure, but I don't really see that as acceptable solution. The problem usually shows up only when the mailbox is getting read and modified a lot at the same time. The mtime might never stop changing.

In "normal use" it might work well, but so would locking (and locking would be guaranteed to work).

It could help with external changes which don't use Dovecot's locking, but.. Still sounds pretty ugly :)

Matthias Andree

7:01 p.m.

Timo Sirainen <tss@iki.fi> writes:

...

Looks like maildir can't be used very realiably without quite a lot of locking. Writing and scanning the directory would have to be locked, but reading wouldn't (as long as the file hasn't been renamed which would require scanning to find it). So much for "no locks needed"..

The problem is that opendir()/readdir() may temporarily not return some files if there has been changes in the directory since the opendir(). That means Dovecot thinks a message is expunged, while in fact it really isn't, and the next scan would usually show it again.

I'm not sure if the claims are about locking-free scanning (but I believe DJB of Bold Yet Hollow Announcements fame just touted "no locks"); one point is locking-free delivery because if opendir/readdir misses a _new_ file, no harm is done.

qmail is so full of bugs I effectively stopped maintaining my qmail-bugs page because I grew tired of researching bugs of a system I stopped using years ago and wackos refuting the bugs <http://home.pages.de/~mandree/qmail-bugs.html>, I only recently found out that qmail-pop3d doesn't get article sizes (in LIST) right. Shame on DJB for claiming efficiency and standards compliance when his nutshell is rather shipwreck, and has been unmaintained for six years...

-- Matthias Andree

Matthias Andree

7:17 p.m.

Timo Sirainen <tss@iki.fi> writes:

...

I wrote a test program which tests this:

http://dovecot.org/tmp/readdir.c

Test results:

File re-appeared messages on SuSE Linux 9.1 x86 (Kernel 2.6.5 patched by SuSE) on these file systems (all local; fast machine, Athlon XP 2500+):

tmpfs xfs ext3 reiserfs

File re-appeared messages on Solaris 9 x86 (fast machine):

swap (/tmp) ufs logging (/var/tmp) nfs (from above Linux server) nfs (from below FreeBSD server)

File re-appeared messages on FreeBSD 4.10-RELEASE-p3 x86 (slow machine, K6-2/300):

mfs (/tmp) nfs (from above Linux server) nfs (from above Solaris server) ufs softupdates (/var/tmp)

So I'd think I have nothing that "works" for your application profile unfortunately.

-- Matthias Andree

Geo Carncross

11:11 p.m.

This is nonsense. The problem is that the behavior of readdir() is confusing.

Why should unlink() or rename() invalidate data that your C library ALREADY READ from the directory?

This is like saying "I fgetc()'d a byte, but now lseek() shows that my offset is 1024!" - it's silly.

Just put a: if (stat(d->d_name, &sb) == -1)continue;

After your check for the "." in the first character of the d->d_name (about line 41) and all will be good. No amount of twiddling with USE_UNLINK or FILES is going to affect it.

So you say, "I need to stat() each entry? That's going to create a large number of syscalls!"

Of course. For readdir() to be atomic, it would need to do a system call for each directory entry. This is exactly why readdir() doesn't, so that you do one syscall for every (say) 50 entries, and if you want validity, you'll do a stat() yourself.

Now: Maildir quite obviously wasn't designed with IMAP in mind. IMAP has some (largely ridiculous) requirements that Maildir simply doesn't make easy.

The largest problem (with Maildir) is this renaming of file identifiers and moving things in and out of cur/. It's only necessary so programs don't have to open() in order to read flags (after all, they JUST did a readdir())...

Since the names aren't going to change in cur/, you can get away with just doing a stat() in there [[ after all, you just rename()'d it into cur/ if you're working on new ]]

Unfortunately, cur/ is often bigger than new/.

On Sun, 2004-10-24 at 20:02, Timo Sirainen wrote:

...

Looks like maildir can't be used very realiably without quite a lot of locking. Writing and scanning the directory would have to be locked, but reading wouldn't (as long as the file hasn't been renamed which would require scanning to find it). So much for "no locks needed"..

The problem is that opendir()/readdir() may temporarily not return some files if there has been changes in the directory since the opendir(). That means Dovecot thinks a message is expunged, while in fact it really isn't, and the next scan would usually show it again.

Currently when that happens, Dovecot usually prints an error message about it and rebuilds indexes. Of course, in real life clients aren't often bombing the same mailbox with tons of changes in multiple connections, which is usually needed to trigger this.

I wrote a test program which tests this:

http://dovecot.org/tmp/readdir.c

I'd like to hear if you can run it in some system without errors. I tested Linux 2.4 and 2.6 with ext2, ext3, xfs and reiser3, Solaris 8/ufs and OpenBSD 3.5/sparc64. Only OpenBSD passed the test, but I'm not sure if it's only because the computer was so slow and didn't switch between processes hard enough. I'd be especially interested about FreeBSD and various NFS systems.

If it actually works properly in some systems, I guess I'll make the extra locking configurable.

Geo Carncross <geocar@internetconnection.net> Internet Connection Reliable Web Hosting http://www.internetconnection.net/

Timo Sirainen

11:32 p.m.

On 25.10.2004, at 23:11, Geo Carncross wrote:

...

This is nonsense. The problem is that the behavior of readdir() is confusing.

Why should unlink() or rename() invalidate data that your C library ALREADY READ from the directory?

Why do you think it was already read? It wasn't. That's the problem. An existing renamed file may never be returned by one opendir() .. readdir() .. closedir() loop.

...

  if (stat(d->d_name, &sb) == -1)continue;
After your check for the "." in the first character of the d->d_name (about line 41) and all will be good. No amount of twiddling with USE_UNLINK or FILES is going to affect it.

Right. Because the stat() always fails so the whole thing does nothing. If you actually do the correct check:

             sprintf(path, PATH"/%s", d->d_name);
             if (stat(path, &sb) &lt; 0)
                     continue;

Then it's just as broken as before, but works more slowly.

...

Of course. For readdir() to be atomic, it would need to do a system call for each directory entry. This is exactly why readdir() doesn't, so that you do one syscall for every (say) 50 entries, and if you want validity, you'll do a stat() yourself.

I don't have a problem with readdir() returning a file that doesn't exist anymore. I have a problem of readdir() not returning an existing file. The exact opposite.

...

Now: Maildir quite obviously wasn't designed with IMAP in mind. IMAP has some (largely ridiculous) requirements that Maildir simply doesn't make easy.

UIDs mostly.

...

The largest problem (with Maildir) is this renaming of file identifiers and moving things in and out of cur/. It's only necessary so programs don't have to open() in order to read flags (after all, they JUST did a readdir())...

Out of cur/? open() to read flags? I don't understand.

...

Since the names aren't going to change in cur/, you can get away with just doing a stat() in there [[ after all, you just rename()'d it into cur/ if you're working on new ]]

Unfortunately, cur/ is often bigger than new/.

Are you trying to say that files wouldn't be allowed to be renamed inside cur/ to change their flags?

Geo Carncross

26 Oct 26 Oct

2:35 a.m.

On Mon, 2004-10-25 at 16:32, Timo Sirainen wrote:

...

On 25.10.2004, at 23:11, Geo Carncross wrote:

...
This is nonsense. The problem is that the behavior of readdir() is confusing.

Why should unlink() or rename() invalidate data that your C library ALREADY READ from the directory?

Why do you think it was already read? It wasn't. That's the problem. An existing renamed file may never be returned by one opendir() .. readdir() .. closedir() loop.

Because strace says so. If you use getdents() directly, the problem magnifies significantly [see below].

...

...
  if (stat(d->d_name, &sb) == -1)continue;
After your check for the "." in the first character of the d->d_name (about line 41) and all will be good. No amount of twiddling with USE_UNLINK or FILES is going to affect it.
Right. Because the stat() always fails so the whole thing does nothing. If you actually do the correct check:
             sprintf(path, PATH"/%s", d->d_name);
             if (stat(path, &sb) &lt; 0)
                     continue;
Then it's just as broken as before, but works more slowly.

Bah. I typed it correctly on my end :)

My problem here is I got lucky three times in a row, and misread your post.

...

...
Of course. For readdir() to be atomic, it would need to do a system call for each directory entry. This is exactly why readdir() doesn't, so that you do one syscall for every (say) 50 entries, and if you want validity, you'll do a stat() yourself.

I don't have a problem with readdir() returning a file that doesn't exist anymore. I have a problem of readdir() not returning an existing file. The exact opposite.

But it _doesn't_ exist. the opendir() gets a file descriptor- we don't get to the old-data yet, but the new-data isn't necessarily put ahead of our current offset (it isn't actually put anyplace reachable...).

I think I understand better what the problem is:

If you don't attempt to detect it internally (and instead just puts the file number- :.* stripped off) you'll see: ./readdir | sort -bn | uniq -c | sort -nr

produces entries LESS THAN 11- which you're hoping it wouldn't.

The only ways the operating system could do this are: 1. be notified of name changes ([DI]notify - what courier does) 2. change the semantics directories in the kernel such that NEW NAMES always appear at the end of the directory.

#2 would be awful hard, but #1 could be handled right here, although it wouldn't be portable [then, neither would #2, but read on...]

...

...
Now: Maildir quite obviously wasn't designed with IMAP in mind. IMAP has some (largely ridiculous) requirements that Maildir simply doesn't make easy.

UIDs mostly.

Agreed.

{{ although if UIDS were 64-bit, or better still- simply numeric, AND didn't have that always-incrementing rule, anything from an mbox file offset to an inode number would be satisfactory. }}

...

...
The largest problem (with Maildir) is this renaming of file identifiers and moving things in and out of cur/. It's only necessary so programs don't have to open() in order to read flags (after all, they JUST did a readdir())...

Out of cur/? open() to read flags? I don't understand.

Many flags can be answered with the contents of d_name. This fact may make programs like mailx and pop3d very simple, but it makes generating a mapping between Mark Crispin's UID and filenames very complex.

...

...
Since the names aren't going to change in cur/, you can get away with just doing a stat() in there [[ after all, you just rename()'d it into cur/ if you're working on new ]]

Unfortunately, cur/ is often bigger than new/.

Are you trying to say that files wouldn't be allowed to be renamed inside cur/ to change their flags?

No. I'm saying they don't have flags in new/. The real, legitimate problem can't happen in new/ - only in cur/ because messages aren't renamed within new/.

Although, it wouldn't be compatible with djb-Maildir, it would certainly avoid this problem if rename() were never called in cur/....

[[ surely there are other places to store flags... ]]

-- Geo Carncross <geocar@internetconnection.net> Internet Connection Reliable Web Hosting http://www.internetconnection.net/

Matthias Andree

3:21 a.m.

New subject: [Dovecot] Re: Maildir unreliability

Timo Sirainen <tss@iki.fi> writes:

...

I don't have a problem with readdir() returning a file that doesn't exist anymore. I have a problem of readdir() not returning an existing file. The exact opposite.

What rename/move operation exactly is problematic? Files _should_ be travelling from new/ to cur/, not vice versa, although marking a mail as "new" in mutt for instance might cause the "reverse" move.

So the problem appears to be that readdir misses a file renamed from cur/ back into new/ behind our backs, no?

I haven't read your code yet, is a renamed file in the same directory hiding from readdir() the problem?

...

...
Unfortunately, cur/ is often bigger than new/.

Are you trying to say that files wouldn't be allowed to be renamed inside cur/ to change their flags?

Certainly not.

-- Matthias Andree

Geo Carncross

8 p.m.

New subject: [Dovecot] Re: Maildir unreliability

On Mon, 2004-10-25 at 20:21, Matthias Andree wrote:

...

Timo Sirainen <tss@iki.fi> writes:

...
I don't have a problem with readdir() returning a file that doesn't exist anymore. I have a problem of readdir() not returning an existing file. The exact opposite.

What rename/move operation exactly is problematic? Files _should_ be travelling from new/ to cur/, not vice versa, although marking a mail as "new" in mutt for instance might cause the "reverse" move.

So the problem appears to be that readdir misses a file renamed from cur/ back into new/ behind our backs, no?

No.

...

I haven't read your code yet, is a renamed file in the same directory hiding from readdir() the problem?

Yes.

P1: opendir(dir1) P1: readdir() P2: rename(dir1/f, dir1/f:2) P1: readdir() <- never sees dir1/f _OR_ dir1/f:2

The problem only happens in cur/ because files aren't renamed *in* new/, but they are renamed *in* cur/ -- this doesn't have to do with renaming messages FROM new/ to cur/.

-- Geo Carncross <geocar@internetconnection.net> Internet Connection Reliable Web Hosting http://www.internetconnection.net/

Gregory Bond

27 Oct 27 Oct

2:19 a.m.

New subject: [Dovecot] Re: Maildir unreliability

Geo Carncross wrote:

...

P1: opendir(dir1) P1: readdir() P2: rename(dir1/f, dir1/f:2) P1: readdir() <- never sees dir1/f _OR_ dir1/f:2

This is an artifact of the directory entry creation "find first fit" algorithm, and is likely to only be a problem on directories of > 1 block in size (so the readdir is not atomic), with lots of deletes and renames (so lots of free space at the front of the directory). Like, oh, for example, large active Maildir folders. And it would only happen when renaming to longer names (and hence moving the directory entry around); shortening the name or editing it would (usually!) happen in-place.

Fixing this in general is going to need serious kernel hacking is likely to be very hard, and have ugly performance or space tradeoffs.

Hmmm.... dunno if this is acceptable for Maildir, but can you pad the filenames somehow so that the renames never lengthen the filename?

Matthias Andree

10:59 a.m.

New subject: [Dovecot] Re: Maildir unreliability

On Wed, 27 Oct 2004, Gregory Bond wrote:

...

Geo Carncross wrote:

...
P1: opendir(dir1) P1: readdir() P2: rename(dir1/f, dir1/f:2) P1: readdir() <- never sees dir1/f _OR_ dir1/f:2 ...

Fixing this in general is going to need serious kernel hacking is likely to be very hard, and have ugly performance or space tradeoffs.

Hmmm.... dunno if this is acceptable for Maildir, but can you pad the filenames somehow so that the renames never lengthen the filename?

This is not how Maildir works unfortunately.

-- Matthias Andree

7574

Age (days ago)

7576

Last active (days ago)

List overview

15 comments

8 participants

participants (8)

Alex S Moore
Geo Carncross
Gregory Bond
loo
Matthias Andree
Miquel van Smoorenburg
Peter Hessler
Timo Sirainen