[Dovecot] Design: Asynchronous I/O for single/multi-dbox

Sun Mar 14 01:59:56 EET 2010

The long term plan is to get all of Dovecot disk I/O asynchronous. The
first step to that direction would be to make dbox/mdbox I/O
asynchronous. This might also allow mbox/maildir to become partially
asynchronous.

http://www.dovecot.org/list/dovecot/2009-December/045481.html already
started describing how the lib-storage API could be changed to support
high-latency storages. Adding support for all of the paralellism would
be nice, but probably not necessary as the first step. Then again, the
API changes are necessary, so the parallelism optimizations probably
aren't a huge task on top of that.

Besides the API changes described in the above link, another change
that's necessary is to add mail_get_nonblocking_stream(). The returned
stream can return EAGAIN whenever it would need to block. Whenever
sending message data to client, this stream would have to be used. It
would also be used internally by all of message parsing/searching code.
Luckily all of that already supports nonblocking input or can be easily
modified to support it.

Below are some thoughts how to go forward with this. I originally
thought about writing a more specific plan, but I think this is good
enough now to start coding. The target Dovecot version is v2.1, not
v2.0.

Filesystem API
==============

The idea is to first abstract out all POSIX filesystem accessing in
dbox/mdbox code. Non-blocking I/O works pretty nicely for socket I/O, so
I thought I'd use a similar API for disk I/O as well:

handle = open(path, mode)
 - this function can't fail. it's executed asynchronously.
 - mode=read-only: no writing to file
 - mode=append: appending to file is allowed

handle = create(path, mode, permissions)
 - this function can't fail. it's executed asynchronously.
 - mode=fail-if-exists: commit() fails if file already exists
 - mode=replace-if-exists: commit() replaces file if it already exists
 - permissions: i'm not entirely sure how this works yet. but it should
contain mode and gid in some format

set_input_callback(handle, callback, context)
set_output_callback(handle, callback, context)
 - call the callback when more data can be read/written

ret = pread(handle, buf, size, offset)
 - just like pread(), but can fail with EAGAIN if there are no bytes
already buffered. so the idea is that the backend implementation would
buffer/readahead data, which would be returned by this call. this would
require memcpy()ing all data, but it might get too complex/fragile if it
was asynchronously written to given buffer.

ret = write(handle, buf, size)
 - append data to given file and return how many bytes were actually
added to write buffer. works in a similar way than writing to socket.
data can only be appended to files, there is no support for overwriting
data.
 - no writes will be visible to reads until commit() is called

ret = commit(handle, [filename])
 - commit all previous writes to disk. either returns success/EAGAIN.
 - if filename is given and a new file is being created, the filename is
changed to the given one instead of using the original path's filename.
this is needed because e.g. mdbox saving can write many temp files in a
single transaction and only at commit stage it locks the index files and
knows what the filenames will be.

rollback(handle)
 - rollback all previous writes.

close(handle)
 - if file was created and not committed, the temp file will be deleted
 - does implicit rollback

ret = try_lock(handle)
 - this isn't an asynchronous operation! it assumes that locking state
is kept in memory, so that the operation will be fast. if backend
doesn't support locking or it's slow, single-dbox should be used
(instead of multi-dbox), because it doesn't need locking.
 - returns success or "already locked"
 - only exclusive locking is possible

Async IO streams
================

Async input streams are created with FS API handle, so it's possible to
start reading from them before the file has even been open()ed. The
callers must be aware of this and realize that read() can fail with
ENOENT, etc.

Async input streams' read() would work exactly as file_istream works for
non-blocking sockets: It would return data that is already buffered in
memory. If there's nothing, it returns EAGAIN. The FS API's
set_input_callback() can be used to set a callback function that is
called whenever there's more data available in the buffer.

Async output streams also work the same as non-blocking file_ostreams:
write() returns the number of bytes added to buffer. When buffer becomes
full, it starts returning EAGAIN. The ostream handles flushing
internally the same way as file_ostreams does, although instead of using
io_add(IO_WRITE) it uses FS API's set_output_callback(). If callers need
to know when more data can be written or when all of the data has been
written, it can override the ostream's flush_callback, just like with
file_ostreams.

Async IO for FS API backend
===========================

So now that all of the APIs have been designed, all that's needed to do
is to write a simple FS API implementation using kernel's async IO API,
right? Wrong. There is no usable async IO API in Linux, and probably
nowhere else either:

 * POSIX AIO isn't supported by Linux kernel. And even if it was, it
only supports async reads/writes, not async open().

 * Linux kernel has its own native AIO implementation! ..But it only
works with direct IO, which makes it pretty much useless for almost
everyone. There have been many different attempts to get buffered AIO
support to Linux, but all of them have failed.

So for now the only practical way is to implement it with threads. There
are several libraries that could make this easier.. But all of them
enable (and require) full thread safety for libc calls. I don't really
like that. Dovecot isn't using threads, I shouldn't pay the penalty of
using extra locking when it's really not necessary. So I was thinking
about doing the async IO in two ways:

1) For Linux/x86/x86-64 (and maybe others) implement a version that
creates threads with clone() and uses lockless queues for communicating
between the async io worker threads. The threads won't use malloc() or
any other unsafe calls, so this should be pretty nice.

2) Fallback version that uses pthreads with mutexes.

Is 2) going to be noticeably slower than 1)? Probably not.. In both
cases there is also the problem of how many worker threads to create.
I've really no idea, kernel would be so much better in deciding that.. I
guess that also depends on how many processes Dovecot is configured to
run on (and how busy they are at the moment). Maybe it could be just a
globally configurable number of threads/process, defaulting to something
like 10.

dbox changes
============

Most of the dbox code shouldn't be too difficult to change to using the
FS API. The one exception is writing messages. Currently the format
looks like:

<dbox header magic (2 bytes)>
<dbox message header, which contains message size><lf>
<message>
<dbox metadata magic (2 bytes)><lf>
<dbox metadata>
<lf>

The problem is that the message size isn't always known before the
message is written. So currently the code just pwrite()s the size
afterwards, but this is no longer possible with the new FS API. One
possibility would be to buffer the message in memory, but that could
waste a lot of memory since messages can be large.

So the only good solution is to change the dbox file format to contain
the message size after the message. At the same time the dbox format
could be cleaned up from old broken ideas as well. The reading code
could support the old version format as well, because reading isn't a
problem.

The new dbox format could look like:

<dbox header magic (2 bytes)>
<dbox unique separator (random 16 bytes or so)><lf>
<message>
<dbox metadata magic (2 bytes)>
<dbox unique separator, same as above><message size><lf>
<dbox metadata, last item being metadata size>
<lf>

The unique separator exists there primarily for fixing corrupted dbox
storage, so it can (quite) reliably recognize where a message ends.

Multi-dbox's indexes contain already the message offset + size of
(message+metadata), i.e. offset+size = offset of next message. The size
is primarily used to figure out if more data can be appended to the file
(so it won't grow too large). This size could just be changed to become
message's size and the check just changed to assume metadata size would
be e.g. 512 bytes. It won't really matter in practise. So then the
reading code could get the message's size from the index.

Single-dbox doesn't have such index. There are two ways to solve the
problem there (both actually would work for multi-dbox too):

1) Metadata block can be found by reading backwards from end of file.
For example read the last 512 bytes of the file and find the metadata
magic block. Get the message's size from there.

2) Create an istream that simply reads until it reaches dbox metadata
magic and unique separator. And to be absolutely sure it reached the
correct end of message, it can also check that the rest of the data in
the stream contains only a valid metadata block.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20100314/59002349/attachment-0001.bin