[Dovecot] dbox redesign

12 Feb 2009

      This is about how to implement multiple msgs/file dbox format. The
current v1.1's one msg/file design would stay pretty much the same and
it would be compatible with this new design.
dbox directories with multiple msgs/file would be like:
~/dbox/storage/ has the actual mail data for all mailboxes
~/dbox/mailboxes/ has subdirectories containing mailboxes and their
indexes
Also since dbox supports already the single msg per file, those files
would be stored in the mailboxes/ directory. So the idea would be that
either you use multiple msgs per file using a global storage, or you use
single msg per file without a global storage (or it's also possible to
be in a mixed setup with some mails in storage/ and some in mailboxes/,
mainly to allow migration between those configurations).
The storage/ directory would have a new "map index" which is a regular
dovecot index (dovecot.index and dovecot.index.log). So the mailbox
index would point to mails using an intermediary "map UID". This way if
mails are moved to another file only the map index needs to be updated.
GUID would be a globally unique 128 bit ID for messages. So if map
indexes get corrupted for any reason it's possible to rebuild it by
finding the mails using GUIDs.
v1.1 dbox has this "dbox.index" file which I was originally planning on
using with multiple msgs/file. It had complex file range locking stuff.
Now I'm thinking that it's pretty much useless. The only reason for its
existence with the new design is for listing metadata for files
converted from Maildir.
Map index record would contain:

32 bit map UID
8 bit flags (MAIL_DELETED flag = message marked as expunged)
8 bit unused wasted space
16 bit refcount
32 bit file sequence
32 bit file offset
--> total 128 bits/msg

Mailbox index:

IMAP UID, flags, keywords, etc.
32 bit map UID
128 bit GUID

dbox file metadata:

128 bit GUID
size, vsize, received time, saved time, etc.
initial mailbox name (if all indexes get trashed, we can still figure
out at least one mailbox where to put the mail. copies would get lost
though.)
(- no map UID, no imap UID)

How to save a message with multiple msgs/file:

Find dbox file where to append to:
1.1. Look up the last message from map index
1.2. Is the file "too old"? (or doesn't exist at all)

Yes -> Create new dbox file
1.3. Is the file "too large"?
Yes -> Look at the previous file (one sequence less) and goto 1.2.
1.4. Try to lock the file.
Fail -> Look at prev file and goto 1.2.

Now we have a locked/new dbox file where we can write to. Because 1.4.
step only tries to lock the file, there's no waiting on locks. This also
means that if e.g. two processes are writing new messages rapidly they
may be appending actively to two different files. I don't think that's a
problem, better than waiting for locks.
2a) We're using an existing file and we need to find the append offset.
Since we found the file by finding the last msg in the file, we also
know the last message's offset. I wasn't really planning on saving the
message sizes in the index file, so to get the append offset I guess it
needs to do an extra read on the last msg's header to find the size and
skip over it. Hmm. Or would it be less disk I/O to store the size on the
index so it could be found directly? I'm not really sure..
In any case, after we find the append offset, check to see if it's at
EOF. If not, that means that either another process just saved a new
message there or a process crashed previously and left garbage lying
around. Refresh the map index to see if this file+offset exists in it.
If not, truncate the file and just continue writing there. If it exists,
figure out the new append offset and see again if the file limit would
be reached. If the file would become too large, unlock the file and goto
step 1.
2b) We're writing to a new file. No need to worry about anything in 2a)

Write the message and its metadata to dbox file (including generated
128 bit GUID).

Assign map UIDs for the written mails and write APPEND records to map
index's transaction log. The record would contain the map UID, file seq,
offset, refcount=1. The transaction is saved with a "weak" flag (wonder
if there's a better name for this) and its offset is remembered.

If we're creating a new dbox file, it's assigned the file seq and
rename()d to the final file name while the map index is locked.

Write APPEND record to mailbox index's transaction log with IMAP UID,
map UID and GUID (and flags, keywords, etc).

Write "commit offset=x" record where x is the offset remembered in
step 4. This marks the 4's weak transaction as being fully finished.

dbox file is unlocked (if we weren't creating a new file).

When reading the index and we see a weak transaction without a commit
record, call a resolve() function in dbox code. It finds the dbox file
in the weak transaction and tries to lock it. If it can't lock it, it
(probably) means that there's still a process that's going to finish the
save. If a locking succeeded, run a fixup code that goes through all
files that are referenced in weak transactions and makes sure all the
refcounts and such are valid. Then it writes commit records for each
such transaction.
Copying can be done by locking source dbox file, writing a weak
"refcount++" transaction and continuing from saving step 5.
The only step I can think of where there's a small problem is if data is
written to dbox file, then process dies, but no saving is attempted to
the same file later. The data is left there as garbage until maybe some
day later when a message is expunged from that file and cleanup code
notices that there's extra data. I thought about reordering steps so
that the records are written to map index first, but that's less
efficient when writing multiple messages, because each one would need to
be saved separately to the map index (where each save would be lock +
fstat (+ read) + write + unlock). This is probably a rare problem which
doesn't really matter..
Expunging:
The "expunge" and "cleanup to remove deleted space" are separate steps.
Preferably the cleanup would be done when the server is "less busy", but
I guess it could also be configured to happen either immediately or
during logout (or once an hour or so).

Write "want to expunge" record to mailbox's transaction log.
Write a weak "refcount--" to map index's transaction log.
Write "expunged" record to mailbox's transaction log.
Write "commit" for the weak refcount--.

(The two-stage want-to-expunge(1)/expunged(3) is already done for
maildir/mbox also.)
I was also thinking maybe this could be just 3 steps without a weak
transaction. Especially since without a fsync() between 2 and 3 there's
really no guarantee about the order in which 2 and 3 happen. And
fsync()s are kind of annoying. But then again if that isn't done and it
crashes between 2 and 3 the expunge is tried again later which now
decrements the refcount again. If the message's original refcount was 1
it would now be negative. That's not really a problem. More problematic
is if the message had been copied to another mailbox and refcount=2, now
the refcount would be 0 and the message in the other mailbox would get
unexpectedly lost. Not good.
Space deletion cleanup:

Check from map index header to see there are any refcount=0 mails.
Although having this step probably requires some annoying dbox-specific
code in index handling code. I guess I'll have to see later if there's a
non-horribly-ugly way to do this.

Go through all records in the map index. Get a list of files that
have refcount=0 records. Also make a copy of file seq+offsets and sort
them (so at step 3 we can easily find what mails exist).

For each file that has refcount=0 records:

3.1. Try to lock the file.
-> Fail: Skip to next file. Either someone's appending to it or
cleaning it up already.
3.2. Refresh map index. Add any newly added records to the sorted
records list (rare).
3.2. Find the first refcount=0 mail.
3.3. Copy all mails after that with refcount>0 to a new file.
3.4. Lock map index, assign new file seq for the new file, rename() it,
for each mail we just copied, using a weak transaction write the seq
+offset updates to map index's transaction log. Unlock map index.
3.5. Depending on if first refcount=0 mail was the first mail, either
unlink() the file or ftruncate() it.
3.6. Write "commit" record to finish the weak transaction.
That's a bit inefficient when cleaning up multiple dbox files, but I'm
not sure if there's a much better way unless the map index is locked for
the entire time. Since the copying might take a while, a long map index
lock is bad since it prevents messages from being saved.
Reading mails can be done without locking. But since a file can be
truncated unexpectedly, the code must be able to find the new file and
offset in that situation and continue reading from there. If the message
we were reading was just expunged+cleaned up, the client will have to be
just disconnected. Since the cleanup shouldn't normally happen
immediately after expunge, this should happen rarely/never.
Oh and about all these new temp files that are created. When dbox is
opened the dir is stat()ed and if its atime is older than 1 week
(currently hardcoded) it readdir()s through the dir to see if there are
any old temp* files and then unlinks them.
With the new design flags/keywords would never be written to actual dbox
files. This helps with backups and also avoids the horribleness of
having to increase metadata area. To avoid complete data loss due to
index corruption (which really shouldn't be happening nowadays), the
indexes are also once in a while backed up to e.g. dovecot.index.backup
file. If Dovecot detects missing/broken index, it uses this backup copy
to rebuild the index as well as it can.
I'm also wondering if it's better for each mailbox to have its separate
dovecot.index.cache file or if there should be one cache file for the
map index. The upside with one index is that if a mail is copied there's
no need to duplicate its cached data. The downside is that I'm worried
it would cause more disk I/O, since mailboxes are typically accessed
separately, so it might be better that the cache data for one mailbox is
stored together rather than all around one bigger cache file.

[Dovecot] dbox redesign

Timo Sirainen