[Dovecot] How to get rid of locks
Although Dovecot is already read-lockless and it uses only short-
lived write locks, it's be really nice to just get rid of the locking
completely. :)
I just figured out that O_APPEND is pretty great. If the operating
system updates seek position after writing to a file opened with
O_APPEND, writes to Dovecot's transaction log file can be made
lockless. I see that this works with Linux and Solaris, but not with
OS X. Could you BSD people try if it works there? http://dovecot.org/
tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay).
The O_APPEND at least doesn't work with NFS, so it'll have to be
optional anyway.
Currently Dovecot always updates dovecot.index file after it has done
any changes. This isn't really necessary, because the changes are
already in transaction log, so the dovecot.index file can be read to
memory and the new changes applied on top of it from transaction log
(this is pretty much how mmap_disable=yes works). So I'm going to
change this to work so that the dovecot.index is updated only if a)
there are enough changes in transaction log (eg. 8kB or so) and b) it
can be write-locked without waiting.
Maildir then. It has this annoying problem that readdir() can skip
files if another process is rename()ing them, causing Dovecot to
think that the message was expunged. The only way I could avoid this
by locking the maildir while synchronizing it. Today I noticed that
this doesn't happen with OS X. I'm not sure if I was just lucky or if
there really is something special implemented in it, because it
doesn't work anywhere else. I'm not sure if this is tied to HFS+, or
if it will work with zfs also (Solaris+zfs didn't work). So perhaps
the locking could be disabled while running with OS X.
More importantly I figured out that it can also be avoided with Linux
+inotify. As long as the inotify event buffer doesn't overflow, the
full list of files can be read by combining the readdir() output and
files listed by inotify events. If the inotify buffer overflows
(highly unlikely), the operation can just be retried and it most
likely works the next time.
So with these changes in place, changing a message flag or expunging
a message would usually result in:
- lockless write() call to dovecot.index.log
- lockless read()ing (or looking into mmaped) dovecot.index.log to
see if there's some new data besides what we just wrote that needs to
be synchronized to maildir - rename() or unlink() calls to maildir. If a call return ENOENT,
the maildir needs to be readdir()ed with inotify enabled to find the
new filename.
Not a single lock in the operation, assuming that dovecot.index file
wasn't updated.
Assigning UIDs to newly delivered mails would require locking though.
dovecot-uidlist needs to be locked, and the UIDs need to be written
to dovecot.index.log file in the correct order, which can also be
done with dovecot-uidlist locking.
Actually a single write() to dovecot.index.log isn't enough. I think
there needs to be some kind of a flag written to the beginning of the
transaction which marks the transaction as truly finished. If the
flag isn't there, any reader knows to stop and wait until the flag is
set. So this means that the writer needs to:
- Do a single O_APPENDed write() call writing the whole transaction
- Get the current offset with lseek(fd, 0, SEEK_CUR) (this is what
the append.c tester checks) - pwrite() the finished-flag to beginning of the transaction Except
at least with Linux pwrite() doesn't work if O_APPEND is enabled.
There are two ways to work around this: a) fcntl(disable O_APPEND) + pwrite() + fcntl(enable O_APPEND) b) Keep two file descriptors open for the transaction log. First
with O_APPEND flag and second without. pwrite() to the second one.
a) is probably better because it doesn't waste file descriptors.
On Apr 7, 2007, at 12:30, Timo Sirainen wrote:
Although Dovecot is already read-lockless and it uses only short- lived write locks, it's be really nice to just get rid of the
locking completely. :)I just figured out that O_APPEND is pretty great. If the operating
system updates seek position after writing to a file opened with
O_APPEND, writes to Dovecot's transaction log file can be made
lockless. I see that this works with Linux and Solaris, but not
with OS X. Could you BSD people try if it works there? http:// dovecot.org/tmp/append.c and see if it says "offset = 0" (bad) or
non-zero (yay). The O_APPEND at least doesn't work with NFS, so
it'll have to be optional anyway.
I tested on Mac OS X. Works on Intel (offset = 5), but not PowerPC
(offset = 0)... Both are running 10.4.9.
Kernel versions:
Darwin Kernel Version 8.9.0: Thu Feb 22 20:54:07 PST 2007;
root:xnu-792.17.14~1/RELEASE_PPC
Darwin Kernel Version 8.9.1: Thu Feb 22 20:55:00 PST 2007;
root:xnu-792.18.15~1/RELEASE_I386
-jim
On Sat, Apr 07, 2007 at 10:30:25PM +0300, Timo Sirainen wrote:
I just figured out that O_APPEND is pretty great. If the operating
system updates seek position after writing to a file opened with
O_APPEND, writes to Dovecot's transaction log file can be made
lockless. I see that this works with Linux and Solaris, but not with
OS X. Could you BSD people try if it works there? http://dovecot.org/ tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay).
FreeBSD 5.2: 5,10,15 etc, so yay ancient BSD/OS: ditto [my FreeBSD 6.2 system is unavailable at the moment, but I can't imagine that they broke it there]
mm
Timo Sirainen wrote:
I see that this works with Linux and Solaris, but not with OS X. Could you BSD people try if it works there? http://dovecot.org/tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay).
OpenBSD 3.9 and 4.0 (i386) both give me 5, 10, 15 and so on, so yay!
/Johan
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Timo Sirainen wrote on 7-4-2007 21:30:
Could you BSD people try if it works there? http://dovecot.org/tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay). The O_APPEND at least doesn't work with NFS, so it'll have to be optional anyway. 5.4-RELEASE-p6: yay 6.0-RELEASE: yay
Greets,
Nils -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (MingW32)
iD8DBQFGGDMqMzNX/a06Wq0RAkhiAJ9RjtMRHDRASuHiIrCxmPTJZZ1MFwCfasOR 6W2/mjFuPyf7jbTQfe6zpII= =Y9Zk -----END PGP SIGNATURE-----
Hi Timo,
OS X. Could you BSD people try if it works there? http://dovecot.org/ tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay).
FreeBSD 6.2: offset = 5 FreeBSD 4.10: offset = 5 FreeBSD 4.7: offset = 5 NetBSD 3.0.1: offset = 5
Cor
Timo Sirainen wrote:
Although Dovecot is already read-lockless and it uses only short-lived write locks, it's be really nice to just get rid of the locking completely. :)
I just figured out that O_APPEND is pretty great. If the operating system updates seek position after writing to a file opened with O_APPEND, writes to Dovecot's transaction log file can be made lockless. I see that this works with Linux and Solaris, but not with OS X. Could you BSD people try if it works there? http://dovecot.org/tmp/append.c and see if it says "offset = 0" (bad) or non-zero (yay). The O_APPEND at least doesn't work with NFS, so it'll have to be optional anyway.
Currently Dovecot always updates dovecot.index file after it has done any changes. This isn't really necessary, because the changes are already in transaction log, so the dovecot.index file can be read to memory and the new changes applied on top of it from transaction log (this is pretty much how mmap_disable=yes works). So I'm going to change this to work so that the dovecot.index is updated only if a) there are enough changes in transaction log (eg. 8kB or so) and b) it can be write-locked without waiting.
Maildir then. It has this annoying problem that readdir() can skip files if another process is rename()ing them, causing Dovecot to think that the message was expunged. The only way I could avoid this by locking the maildir while synchronizing it. Today I noticed that this doesn't happen with OS X. I'm not sure if I was just lucky or if there really is something special implemented in it, because it doesn't work anywhere else. I'm not sure if this is tied to HFS+, or if it will work with zfs also (Solaris+zfs didn't work). So perhaps the locking could be disabled while running with OS X.
More importantly I figured out that it can also be avoided with Linux+inotify. As long as the inotify event buffer doesn't overflow, the full list of files can be read by combining the readdir() output and files listed by inotify events. If the inotify buffer overflows (highly unlikely), the operation can just be retried and it most likely works the next time.
So with these changes in place, changing a message flag or expunging a message would usually result in:
- lockless write() call to dovecot.index.log
- lockless read()ing (or looking into mmaped) dovecot.index.log to see if there's some new data besides what we just wrote that needs to be synchronized to maildir
- rename() or unlink() calls to maildir. If a call return ENOENT, the maildir needs to be readdir()ed with inotify enabled to find the new filename.
Not a single lock in the operation, assuming that dovecot.index file wasn't updated.
Assigning UIDs to newly delivered mails would require locking though. dovecot-uidlist needs to be locked, and the UIDs need to be written to dovecot.index.log file in the correct order, which can also be done with dovecot-uidlist locking.
Actually a single write() to dovecot.index.log isn't enough. I think there needs to be some kind of a flag written to the beginning of the transaction which marks the transaction as truly finished. If the flag isn't there, any reader knows to stop and wait until the flag is set. So this means that the writer needs to:
- Do a single O_APPENDed write() call writing the whole transaction
- Get the current offset with lseek(fd, 0, SEEK_CUR) (this is what the append.c tester checks)
- pwrite() the finished-flag to beginning of the transaction Except at least with Linux pwrite() doesn't work if O_APPEND is enabled. There are two ways to work around this: a) fcntl(disable O_APPEND) + pwrite() + fcntl(enable O_APPEND) b) Keep two file descriptors open for the transaction log. First with O_APPEND flag and second without. pwrite() to the second one.
a) is probably better because it doesn't waste file descriptors. This is probably a scary thought, but . . . what would it take for the indexing part of Dovecot to be implemented via an API/plug-in model?
I'm curious about the effect of using an external SQL engine (my vote would be Firebird) for processing these, and using a open plug-in method would allow for that without binding Dovecot to a particular implementation.
-- Daniel
On 8.4.2007, at 10.29, Daniel L. Miller wrote:
This is probably a scary thought, but . . . what would it take for
the indexing part of Dovecot to be implemented via an API/plug-in
model? I'm curious about the effect of using an external SQL
engine (my vote would be Firebird) for processing these, and using
a open plug-in method would allow for that without binding Dovecot
to a particular implementation.
Well.. It would be possible to make the lib-index API completely
virtualized, but I don't think there's much point. The lib-index API
actually doesn't have all that much to do with reading/writing index
files. It's much more about easily manipulating mailbox metadata in
memory.
For example the way I was planning on implementing SQL mail storage
was to create an in-memory index and keep it updated by reading the
data from SQL. The same metadata is in SQL, but it still needs to be
stored into Dovecot's internal structures (== the indexes).
Then there is however dovecot.index.cache file. It's a pretty simple
database, so replacing it with SQL would make more sense. The cache
file API isn't virtualizable yet either, but I was planning on doing
that if I ever got around to making the SQL mail storage plugin
really usable.
SQL cache replacement would need to be a bit tricky however to work.
Currently lib-storage API works like:
- mail_alloc() is done first. It tells what fields it most likely
wants to fetch. - mail_set_seq() can be used to switch to whatever message in mailbox
- mail_get_*() functions can be used to fetch the message data.
The simplest SQL implementation would just do a SQL query for each
mail_get_*() call, but this would also be the slowest implementation.
A bit better would be to use one SQL query to fetch all the data
specified by mail_alloc() in the first mail_get_*() function call.
Then if something extra is fetched that would generate extra SQL
queries.
However most of the time mail_set_seq() isn't used randomly. It's
mostly done only when building a reply for THREAD command. Usually
searching is used:
- mailbox_search_init() specifies search arguments. For FETCH
commands this is simple "sequences 1-10". - mailbox_search_next() finds the next match and calls mail_set_seq () for that mail
So hooking into these functions you could figure out in the _init()
that you want to do the SQL query for messages 1-10, and the first
call to _next() tells you the mail structure where you can get the
list of wanted fields. So IMAP command:
UID FETCH 1:5,10:20 (ENVELOPE BODY INTERNALDATE)
Could be done with a single SQL query, something like:
select envelope, body, internaldate from message_cache where uid
between 1 and 5 or uid between 10 and 20;
On Sat, 2007-04-07 at 22:30 +0300, Timo Sirainen wrote:
Although Dovecot is already read-lockless and it uses only short- lived write locks, it's be really nice to just get rid of the locking
completely. :)I just figured out that O_APPEND is pretty great. If the operating
system updates seek position after writing to a file opened with
O_APPEND, writes to Dovecot's transaction log file can be made
lockless.
Doest his mean there's even less chance of indexes working on NFS (where O_APPEND doesn't really work) ?
That's a pity, as a lot of larger sites use NFS. They are already forced to use indexes on local disk - and dovecot seems to rely on indexes more and more (fulltext index, shared mailboxes, etc).
And mailbox formats like dbox don't even work without an index.
I had hoped that the existing index code would be made more reliable and network-filesystem safe first.
Mike.
On 8.4.2007, at 12.41, Miquel van Smoorenburg wrote:
On Sat, 2007-04-07 at 22:30 +0300, Timo Sirainen wrote:
Although Dovecot is already read-lockless and it uses only short- lived write locks, it's be really nice to just get rid of the locking completely. :)
I just figured out that O_APPEND is pretty great. If the operating system updates seek position after writing to a file opened with O_APPEND, writes to Dovecot's transaction log file can be made lockless.
Doest his mean there's even less chance of indexes working on NFS
(where O_APPEND doesn't really work) ?
No. I haven't forgotten NFS users. You missed this part:
The O_APPEND at least doesn't work with NFS, so it'll have to be
optional anyway.
I'm now trying to think of ways to simplify the index file handling.
That allows me to then implement NFS workarounds more easily, such as
forcing attribute cache flushing when it's needed.
On Sat, 2007-04-07 at 22:30 +0300, Timo Sirainen wrote:
I just figured out that O_APPEND is pretty great. If the operating
system updates seek position after writing to a file opened with
O_APPEND, writes to Dovecot's transaction log file can be made
lockless.
Well, almost. Log rotation isn't possible without some sort of locking. But the locks could still be reduced:
- Normally keep the .log file read-locked all the time (multiple processes can have it read-locked)
- Write to it with O_APPEND
- If you notice that the log is going to be rotated soon, drop the read lock and acquire it only for the duration of appends
- When the log is wanted to be rotated, try to get a write-lock. If it fails, try again later. If it succeeds, it's safe to rotate the log.
participants (8)
-
Cor Bosman
-
Daniel L. Miller
-
Jim Maenpaa
-
Johan Fredin
-
Mark E. Mallett
-
Miquel van Smoorenburg
-
Nils Vogels
-
Timo Sirainen