[Dovecot] new mailbox format v2

Tue Sep 27 21:25:48 EEST 2005

Updated version. The previous one was a bit too complicated with flag
updates for pretty minimal benefits. Changes:

 - writing/reading mail flags can be disabled to get better performance
without it requiring any changes to how the files are normally handled
(so mixed more works fine).
 - writing flags now requires getting shared locks. so there's the
overhead of locking but flag writers don't have to wait for each others.
 - reading requires no locking anymore. the comment in the previous
version about reading only half a mail being a problem doesn't actually
matter, since if a client tries to read an expunged mail Dovecot already
disconnects the client immediately (since most clients can't handle such
situation well anyway).
 - no more problems with mailbox readers locking out expungers
 - note about fcntl() locking being able to provide even higher
concurrency
 - works just fine with plain dotlocks, so works fine with NFS

And looks like I forgot to explain the "mail header padding" before. I'm
not exactly sure if it's useful or not, but it doesn't hurt to have it
there if it's wanted. The idea was anyway that if you're writing more
than one flag at a time, having a 512 byte padding would make sure that
when writing the few bytes the OS will only have to read one sector from
disk (512 bytes in all todays OSes I believe?) to get the non-written
parts of the sector instead of possibly reading two sectors. So it might
help with trading disk reads for some wasted disk space.

Generic
-------

The point is to have a mailbox format where the mailbox can consist of one
or more files. Grouping multiple files in a single file makes it faster to
read, but it's slower to expunge mails from the beginning of the file. So
this format would allow sysadmin to specify rules on how large the files
would be allowed to grow.

Index File
----------

Named "index".

Index file gives a mapping between UIDs <-> filenames:

Header: <uidvalidity> <last-used-uid>
Record: <uid-first> <uid-last> <filename>
Record: ..

eg.:

  1127468129 5
  1 3 msg.1
  4 4 msg.4

All of the UIDs don't have to actually exist in the file if they have been
expunged. The record should be removed once all of the UIDs have been removed
from it (and the file itself been deleted).

The same file may appear multiple times in the index. For example:

  1 2 msg.1
  3 3 msg.3
  4 5 msg.1

This means that UIDs 1, 2, 4 and 5 are in file msg.1, but UID 3 is in file
msg.3.

The filename doesn't matter as long as it's unique, but probably wouldn't
hurt to include its first UID in the filename so humans can find mails more
easily. The filename however must not end with ".lock".

The file is modified by creating first exclusively owned "index.lock" file,
updating it and then rename()ing it over the index file. This lock file
shouldn't be held for a long time, so if it exists, and it doesn't get 
modified at all within 30 seconds, it can be overwritten. Before
rename()ing over the index file, you should check with stat() that the
index.lock file is still the same as you expect it to be (in case your
process was temporarily stopped for over 30 secs).

The index file's mtime timestamp must grow after each modification. This
makes it easy to check if new mail has been added to the mailbox.

Mail File
---------

File header:

Bytes Format Description
8     hex    Header size
16    hex    Append offset
4     hex    Mail header size
4     hex    Mail header padding 
4     hex    Keyword count in mail headers

Mail header:

Bytes Format Description
8     hex    UID
16    hex    Mail size
1     1/0    Answered flag
1     1/0    Flagged flag
1     1/0    Deleted flag
1     1/0    Seen flag
1     1/0    Draft flag
1     1/0    Expunged flag
n     1/0    Keywords

The file layout then goes:

<File header>
  - possible padding due to Header size
<Mail header #1>
  - possible padding due to Mail header size
<Mail data #1>
  - possible padding due to Mail header padding (ie. next Mail header's
    offset is divisible with Mail header padding)
<Mail header #2>
etc..

The UIDs must be ascending within the file (but as said above, some UIDs
between them may exist in other files).

If mail has been marked with expunged-flag, it should be treated as if it
was already expunged. The mail will really be expunged later in some
nightly run.

Reading the files requires no locking. If while reading you happen to hit
unexpected end of file, it means that mails were expunged from the file.
After that you'll look at the index file again to see which file the mail
went and continue from there on. If the mail isn't found, it means it got
expunged. 

Expunging happens by locking the file and then copying all non-expunged
mails after the first expunged one to new file. Then update the index file
with the new location for the UIDs, and at last ftruncate() at the position
of first expunged mail in the original file.

Flag updates require locking the file as well, otherwise we might try to
write past the end of file when another process is expunging the mails.
Shared locking can be used for flag update locking since each flag update
can be atomically changed by writing one byte.

Appending mails can be done by either locking existing file and appending
to it, or creating a new file. If appending to a new file, create it with
.lock suffix and then check that the file without .lock wasn't already just
created to prevent race conditions.

When appending to existing file, check "Append offset" from file's header.
If that offset is at the end of file, just append there. Otherwise read the
mail's header from there. If it has a non-zero UID and index file says that
the same UID exists in this file, jump over the mail and do the same checks
for all the rest of the mails in the file. Once the check doesn't match or
you reach EOF, update the append offset in the header to that offset and
write your mail there.

Initially write the UID in mail's header as zeroes ('0', not NUL). Once the
append is completely finished, you'll have to update it with the new UID.
You'll do this by locking index file, reading the latest last-used-uid and
using the next value for your UID. Once you've written it, rewrite the
index file with updated last-used-uid field and your UID added to it. After
this, update the append offset in the mail file and unlock it.

If fcntl() locking is used, locking only specific ranges can give more
concurrency. Flag updates only need to lock the part of the file they're
modifying. Expunges need to lock the part which it's removing. Appending
needs to lock from <end of file-1> to end of file (ie. l_len=0).

Keywords file
-------------

Named "index.keywords". This contains mappings between keyword numbers and
names. TODO: write how it's updated

Recent file
-----------

"index.recent.<uid>" file tells what the last non-recent UID was. The file
doesn't have any data inside it.

When opening the mailbox you should find and rename() the existing
index.recent.<uid> file to the new index.recent.<last-uid>. A successful
rename() means the mails between <uid>+1 .. <last-uid> are recent for you.
A failed rename() means that someone else managed to do it. You should look
for the file again and if it was renamed to less than <last-uid>, you
should try again.

Nightly cleanups
----------------

Because expunged mails may be left lying around for a while, there probably
should be some nightly check that cleans them up. Other cleanups that may be
necessary after crashes are:

 - .lock files
 - expunged data at the end of files (can be checked with same rules as how
   appending determines the offset where to append)

Alternative ways to use
-----------------------

Expunging mails isn't very filesystem quota friendly since it requires
creating another file to copy the expunged mails to. This could also be
done by just moving the data within the file. This however requires that
all file reads must also be shared-locked. Also it has the problem of
leaving the mailbox corrupted if the copying gets aborted in the middle.

For the best performance the mail flags could be stored only in Dovecot's
index files. In that case the flag update locking isn't slowing down
expunges (and if fcntl() locks can't be used, doesn't slow down appends
either). Also there's no need for keywords/recent files.

Compatibility
-------------

If needed, it would be possible to create new/ and tmp/ directories under
the mailbox to allow mail deliveries to be in maildir format. The maildir
files would be then either just moved or maybe be appended to existing mail
files.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20050927/e2df3910/attachment.pgp