[Dovecot] dbox redesign

12 May 2007

      I don't think anyone uses dbox currently, so the whole format could
still be redesigned. So I was thinking about doing two major changes:

Rely on index files a lot more. The flags are already stored in index
files, so there's no need to waste I/O updating them to dbox files all
the time. They could still be updated (if indexes get deleted, the flags
aren't all gone), but less often.

Require fcntl() locking. Currently dbox uses dotlocks which is slow.

Cydir could be a good alternative also once index file code is made a
bit more robust. Perhaps I could implement single instance attachments
for cydir too..
Locking
The current dbox "index" file would be gone. It's pretty useless.
Replace it with a whole new index file. Or perhaps it should be called
"locks" file or something.
The locks file would contain records: <file id> <lock timestamp>
<status>. So something like:
<header>
1 4645a60f 0
2 00000000 N
3 00000000 D
That would mean that the first file is locked by some process, either
for appending or expunging. The 2nd and 3rd files aren't locked. New
messages can't be appended to 2nd file anymore. 3rd file is already
deleted and this record needs to be removed when rewriting the file.
Locking a dbox file for either appends or expunges is done like:

See if timestamp is zero

If not, see if it's older than .. let's say a day or so ..
a) Yes: Continue to 2.
b) No: Assume the file is locked

Do fcntl() byte range lock over the record line in the locks file.

If it failed, the record is locked

Write the timestamp.
Compare stat() and fstat() inodes to see if the file was rebuilt

If yes, reopen the file and goto 1

File is now locked. Do the append/expunge.
Write timestamp to zero.
Unlock the byte range.

If a file is locked, append will try another file and expunge will mark
the message as expunged instead of actually expunging it yet.
Note that the locks file is read without locking. This is safe because
data is never moved within the file, and it doesn't matter if the
timestamp isn't read correctly always. The timestamp check is only an
optimization. Actually I'm not sure if it would be better not to have
the timestamp at all.
Deleted records will stay in the file until the file is rebuilt. If a
deleted record is noticed in the file, the process tries to lock the
whole locks file. If it succeeds, it proceeds with writing the
non-deleted records to a temporary file and rename()ing it over the
locks file.
Appending

Find the first file in the locks file that has appendable=0

If no such file was found, go to "create a new file" logic as
described below

Lock the file record
Verify from the file's headers that this file can actually be
appended to

If messages have been expunged from a dbox file, it can't be safely
appended to anymore.
Other reasons include eg. configurable max. file size and daily
rotations

Write the mail
Lock locks file's header
Get the UID from "next uid" field and update it
Unlock the header
Unlock the file record
Update index file

Create a new file logic:

Create a temporary file
Write the messages there
Lock locks file's header (including fstat() / stat() rebuild check)
See what the latest file ID is in the file
rename() temp fail to msg.<latest file id+1>
Lock locks file for the range of the to-be-written record below
Write the new record to locks file
Go to step 6 in the original append logic

Syncing / expunging
If locks file's header's "next uid" doesn't match the one currently in
index file, the appending crashed between steps 6 and 9. Find the new
message(s) and append them to index file.
If the locks file is completely gone , rebuild it by going through all
the msg.* files in the directory.
If "expunge counter" (see below) doesn't match in locks file's header
vs. index file header, go through all the msg.* files to see if a
message exists in multiple files. If found, remove the duplicates.
Typically neither of the above happens, so the only thing to do here is
to write changes from index file to dbox files. This may mean flag
changes once in a while, but most importantly expunges will always be
synced.
Initially figure out what files require expunging. Try to create a
single lock range that includes all of them. If there are non-zero lock
timestamps in that range, create multiple ranges.
If a file couldn't be locked, the expunge is done by updating
expunge-flag in the file. This can be done without locking (see below
for flag updates).
If a file was successfully locked, the expunging is done by:

Update "expunge offset" in file header to the offset of the first
expunged message. If expunge offset is non-zero, the file is treated as
non-appendable. Also when rebuilding and finding the same message from
multiple files, this field is used to figure out which file should be
truncated.

Copy the rest of the non-expunged messages to a new temporary file
and add the file to locks file using the "Create a new file" logic
described for appends. Instead of updating "next uid" field, update
"expunge counter" field.

ftruncate() the old file.

Update "expunge counter" in index file.

Sync locking
If there is one big "sync lock", then flags can be updated in any file
with relying only on that lock. Some files may be locked, but they're
locked for appends so flag updates (and other record updates as well)
are safe.
It would be interesting to make it work without that sync lock though,
so that multiple processes could be syncing dbox at the same time. But
there probably isn't much point. If the sync is locked, there's no real
requirement to do the syncing at all then, it's just left to be done
later.
If multiple processes were syncing at the same time, the only problem
would come from flag updates. Either the files would have to be locked
for flag updates (with waiting locks, bad bad), or it would need to be
able to handle the situation when a flag was written just when another
process expunged messages from the file and truncated it. That could be
done by checking (one way or another) after the flag write if a
truncation had happened, and if so do the truncation again.
Preventing out of sync disconnects
Reading messages from dbox files requires no locking, so it's possible
that while one session is reading a message another session goes and
expunges the message by truncating the file. The only way to handle this
situation is to disconnect the client.
Also even if the message wasn't in the middle of being read, this kind
of a out-of-sync could happen if a session hasn't synced itself yet (no
NOOP/IDLE/etc. for a while) and tries to begin reading an expunged
message.
These could be avoided if the expunging process knew that there are
other processes that still know about the message, so it could just mark
the message as expunged instead of actually doing it.
I think this could be done with yet another file that tracks what
messages are expunged. The file would contain "dovecot.index.log file
sequence + offset" records. Whenever message(s) are expunged, a new seq
+offset is written containing the position in the log file for that
transaction.
All the processes accessing the mailbox would then read lock the file
from eof to infinity. The lock is updated whenever mailbox is synced
(and being careful not to have race conditions there).
So syncing process would then try if it can write lock the file from
"last real expunge" log position to end of file. If not, then using some
binary search find the position how far it can be locked. If it managed
to update the write lock position at least some, the messages up to that
far in the transaction log can be expunged safely.
One problem with this is that the file needs to be rotated once in a
while so that it won't grow infinitely. This can be done only if the
write lock can include the whole file (including infinity). And that can
be done only if there's a single process having the mailbox open, which
may never happen with shared mailboxes.
I think I'll leave this feature to be implemented later. It's extra
overhead that's probably rarely useful.
Single instance attachments
I'll change the dbox messages to have variable sized headers. The header
could then contain information such as "at offset 5000 insert 11241242
bytes from file AB123427852".
Fast copying
Would be nice if copying a message from one mailbox to another wouldn't
require actually reading+writing the whole message contents. But I can't
really figure out how to implement this without requiring that there is
only a single dbox storage which contains the mails for all the
mailboxes, and the mailboxes themselves are just Dovecot's index files
containing pointers to the dbox storage.
The problem with having everything in one storage is that if the index
files are broken, the messages can't be placed into correct mailboxes
anymore.
Although one possibility would be treat mailboxes a bit similarly than
keywords. So that when a message is copied to another mailbox, the
message in dbox file is updated to contain information that it exists in
such and such mailboxes. Hmm. Perhaps that would be good enough, yes.

[Dovecot] dbox redesign

Timo Sirainen

Locking

Appending

Syncing / expunging

Sync locking

Preventing out of sync disconnects

Single instance attachments

Fast copying