[Dovecot] Roadmap to future

6 Dec 2007

      v1.1.0 is finally getting closer, so it's time to start planning what
happens after it. Below is a list of some of the larger features I'm
planning on implementing. I'm not yet sure in which order, so I think
it's time to ask again what features companies would be willing to pay
for?
Sponsoring Dovecot development gets you:

listed in Credits in www.dovecot.org
listed in AUTHORS file
you can tell me how you want to use the feature and I'll make sure
that it's possible (within reasonable limits)
allows you to change the order in which features are implemented
(more or less)

As you can see below, some features are easier to implement when other
features are implemented first. Several of them depend on v2.0 master
rewrite, so I'm thinking that maybe it's time to finally finish that
first.
I know several companies have requested replication support, but since
it kind of depends on several other features, it might be better to
leave that for v2.1. But of course sponsoring development of those
features that replication depends on gets you the replication support
faster. :)
I'm still trying to go to school (and this year I started taking CS
classes as well as biotech classes..), so it might take a while for some
features to get implemented. I'm anyway hoping for a v1.2 or v2.0
release before next summer.
So, the list..
Implemented, but slightly buggy so not included in v1.1 (yet):

Mailbox list indexes
THREAD REFERENCES indexes
THREAD X-REFERENCES2 algorithm where threads are sorted by their latest
message instead of the thread root message

Could be implemented for v1.2:

Shared mailboxes and ACLs
http://www.dovecot.org/list/dovecot/2007-April/021624.html
Virtual mailboxes:
http://www.dovecot.org/list/dovecot/2007-May/022828.html
Depends on mailbox list indexes

dbox plans, could be implemented for v1.2:

Support for single instance attachment storage
Global dbox storage so copying messages from mailbox to another involves
only small metadata updates
Support some kind of hashed directories for storing message files, instead
of storing everything in a single directory.
Finish support for storing multiple messages in a file

v2.0 master/config rewrite
It's mostly working already. One of the larger problems left is how to
handle plugin settings. Currently Dovecot just passes everything in plugin
{} section to mail processes without doing any checks. I think plugin
settings should be checked just as well as the main settings. Also plugins
should be able to create new sections instead of having to use plain
key=value settings for everything.
v2.0 currently runs a perl script on Dovecot's source tree to find
settiongs from different places then then creates a single all-settings.c
for config process. This obviously doesn't work for out-of-tree plugins, so
there needs to be some other way for them to pass the allowed settings.
There are two possibilities:
a) Read them from separate files. This is a bit annoying, because it's not
anymore possible to distribute simple plugin.c files and have them compiled
into plugin.so files.
b) Read them from the plugin itself. This would work by loading the plugin
and then reading the variable containing the configuration. I'm a bit
worried about security implications though, because the plugins could
execute some code while they're loaded. But I guess a new unprivileged
process could be created for doing this. Of course normally admins won't
run any user-requested plugins anyway so this might not be a real issue
anyway. :)
Index file optimizations

Since these are not backwards compatible changes, the major version
number should be increased whenever these get implemented. So I think these
should be combined with v2.0 master/config rewrite. I'd rather not release
v3.0 soon after v2.0 just for these..

dovecot.index.log: Make it take less space! Get rid of the current
"extension introduction" records that are large and written over and over
again. Compress UID ranges using similar methods than Squat uses now. Try
all kinds of other methods to reduce space usage.

Write transaction boundaries, so the reader can process entire transactions
at a time. This will make some things easier, such as avoiding NFS data
cache flushes, replication support and also allows lockless writes using
O_APPEND.

dovecot.index: Instead of keeping a single message record array, split
it into two: A compressed list of message UIDs and array containing the
rest of the data. This makes the file smaller and keeping UIDs separate
allows better CPU cache (and maybe disk I/O) utilization for UID ->
sequence lookups. At least this is the theory, I should benchmark this
before committing to this design. :)

There could also be a separate expunged UIDs list. UIDs existing in there
would be removed from existing UIDs list, but it wouldn't affect
calculating offsets to record data in the file. So while currently updating
dovecot.index file after expunges requires recreating the whole file, with
this method it would be possible to just write 4 bytes to the expunged UIDs
list. The space reserved for expunged UIDs list would be pretty small
though, so when it gets full the file would be recreated.

dovecot.index.cache: Better disk I/O and CPU cache utilization by
keeping data close together in the file when it's commonly accessed
toghether. For example message sizes and dates for all messages could be
close to each others to optimize SORTing by message size/date. If clients
use conflicting access patterns, the data could be either duplicated to
optimize both patterns or just let it be slower for the other pattern.

Deliver / LMTP server
Currently deliver parses dovecot.conf using its own parser. This has caused
all kinds of problems. v2.0's master rewrite helps with this, because
deliver can then just ask the configuration from config process the same
way as other processes are.
Another problem is that people who use multiple user IDs have had to make
deliver setuid-root. I really hate this, but currently there's no better
way. Better fix for this would be to use LMTP server instead. There could
be either a separate LMTP client or deliver could support LMTP protocol as
well. I'm not sure which one is better.
How would LMTP server then get around the multiple user IDs issue? There
are two possibilities:
a) Run as root and temporarily set effective UID to the user whose mails we
are currently delivering. This has the problem that if users have direct
access to their mailboxes and index files, the mailbox handling code has to
be bulletproof to avoid regular users gaining root privileges by doing
special modifications to index files -- possibly at the same time as the
mail is being delivered to exploit some race condition with mmaped index
files..
b) Create a child process for each different UID and drop its privileges
completely. If all users have different UIDs, this usually means forking a
new process for each mail delivery, making the performance much worse than
with a) method. It would anyway be possible to delay destroying the child
processes in case a new message comes in for the same UID. This would
probably mostly help only when UIDs are shared by multiple users, such as
each domain having their own UID.
I think I'll implement both and let admin choose which method to use.
This same problem exists for several other things besides deliver/LMTP
server. For example a) method is already implemented for expire plugin's
expire-tool in v1.1. So this code needs to be somewhat generic so it won't
have to be duplicated in multiple places.
All of this pretty much requires the new v2.0 master rewrite. Adding these
to v1.x would be ugly.
Proxying

These could be implemented to v1.2.

Log in normally (no proxying) if destination IP is the server itself.

Support for per-namespace proxying:

namespace public {
prefix = Public/
location = proxy:public.example.org
}
There are two choices for this: Dummy IMAP proxying or handling it with a
mail storage backend. Dummy IMAP proxying would just let the remote server
whose mailbox is currently selected handle most of the commands and their
replies. I'm just afraid that this will become problematic in future.
For example if the proxying server wants to send the client an event, such
as "new message was added to this other mailbox", it would have to be sure
that the remote server isn't currently in the middle of sending something.
And the only way to be sure would be to parse its input, which in turn
would make the proxying more complex. Then there would be problems if the
servers support different IMAP extensions. And maybe some extensions will
require knowing the state of multiple mailboxes, which will make handling
them even more complex if the mailboxes are on different servers.
Implementing this as a mail storage backend would mean that the proxying
server parses all the IMAP commands internally and uses Dovecot's mail
storage API to process them. The backend would then request data from the
remote server when needed and cache it internally to avoid asking the same
data over and over again. Besides caching, this would mean that there are
no problems with extensions, since Dovecot handles the the exact same way
as for other backends, requiring no additional code.
To get better performance the storage backend would have to be smarter than
most other Dovecot backends. For example while the current backends handle
searches by reading through all the messages, the proxy backend should just
send the search request to the remote server. The same for SORT. And for
THREAD, which the current storage API doesn't handle..
And how would proxying server and remote server then communicate? Again two
possibilities: Either a standard IMAP protocol, with possibly some optional
Dovecot-extensions to improve performance, or a bandwidth-efficient
protocol. The main advantage with IMAP protocol is that it could be used
for proxying to other servers. Bandwidth-efficient protocol could be based
on dovecot.index.log format, making it faster/easier to parse and generate.
Once we're that far with proxying, we're close to having..:
Replication
The plans haven't changed much from:
http://www.dovecot.org/list/dovecot/2007-May/022792.html
The roadmap to complete replication support would be basically:

Implement proxying with mail storage API.

Implement asynchronous replication:

Replication master process collecting changes from other Dovecot
processes via UNIX sockets. It would forward these changes to slaves.
Note that this requires that only Dovecot is modifying mailboxes, mails
must be delivered with Dovecot's deliver!
If some slaves are down, save the changes to files in case they come
back up. In case they don't and we reach some specific max. log file size,
keep also track of mailboxes that have changed and once the slaves do come
back, make sure those mailboxes are synced.
If replication master is down and there have been changes to mailboxes..
Well, I haven't really thought about this. Maybe refuse to do any changes
to mailboxes? Maybe have mail processes write to some "mailboxes changed"
log file directly? Probably not that big of a deal since replication
master just shouldn't be down. :)
Support for manual resync of all mailboxes, in case something was done
outside Dovecot.
Replication slave processes would be reading incoming changes. I think
this process should just make sure that the changes are read as fast as
possible, so with sudden bursts of activity the slave would start saving
the changes to a log instead of the master.
Replication mailbox writer processes would receive changes from the main
slave process via UNIX sockets. They'd be responsible for actually saving
the data to mailboxes. If multiple user IDs are used, this has the same
problems as LMTP server.
Conflict resolution. If mailbox writer process notices that the UID already
exists in the mailbox, both the old and the new message must be given new
UIDs. This involves telling the replication master about the problem, so
both servers can figure out together the new UIDs. I haven't thought this
through yet.
Conflict resolution for message flags/keywords. The protocol would normally
send flag changes incrementally, such as "add flag1, remove flag2". Besides
that there could be a 8bit checksom of all the resulting flags/keywords. If
the checksum doesn't match, request all the current flags. It would be of
course possible to send the current flags always instead of incremental
updates, but this would waste space when there are lots of keywords, and
multi-master replication will work better when the updates are incremental.

Synchronous replication:

When saving messages, don't report success until at least one replication
slave has acknowledged that it has also saved the message. Requires adding
bidirectional communication with the replication master process. The
message could sent to slaves in smaller parts, so if the message is large
the wait at the end would basically consist of sending "finished saving
UID 1234, please ack".
Synchronous expunges also? Maybe optional. The main reason for this is that
it wouldn't be nice to break IMAP state by having expunged messages come
back.

Multi-master replication:

Almost everything is already in place. The main problem here is IMAP UID
allocation. They must be allocated incrementally, so each mailbox must
have a globally increasing UID counter. The usage goes like:

Send the entire message body using a global unique ID to all servers.
No need to wait for their acknowledgement.
Increase the global IMAP UID counter for this mailbox.
Send global UID => IMAP UID message to all servers.
Wait for one of them to reply with "OK".

Because messages sent by different servers can end up being visible in
different times, the servers must wait that they've seen all the previous
UIDs before notifying client of a new UID. If the sending server crashed
between steps 2 and 3, this never happens. So the replication processes
must be able to handle that, and also handle requesting missing UIDs in
case they had lost connections to other servers.

I just had the idea of global UID counters instead of global locks, so I
haven't yet thought how the counters would be best implemented. It should
be easier to implement than global locks though.

Non-incremental changes are somewhat problematic and could actually require
global locks. The base IMAP protocol supports replacing flags, so if there
is no locking and conflicting flag replacement commands are sent to two
servers, which one of those commands should be effective? Timestamps would
probably be enough for flag changes, but it may not be enough for some
IMAP extensions (like CONDSTORE?).

If global locks are required, they would work by passing the lock from one
server to another when requested. So as long as only a single server is
modifying the mailbox, it would hold the global lock and there would be no
need to keep requesting it for each operation.

Automation:

Automatically move/replicate users between servers when needed. This will
probably be implemented separately long after 4.

The replication processes would be ugly to implement for v1.x master, so
this pretty much depends on v2.0 master rewrite.

[Dovecot] Roadmap to future

Timo Sirainen

So, the list..

v2.0 master/config rewrite

Index file optimizations

Deliver / LMTP server

Proxying

Replication