New subject: [Dovecot] Re: New mailbox format

23 Sep 2005 · _IF_

      I've been doing some thinking on a new better mailbox format that'd be
faster than mbox and maildir for most common uses. Comments welcome.
It won't come to Dovecot v1.0, but I'll probably create another branch
to Dovecot CVS where I'll start implementing it and some other possibly
destabilizing changes.
Generic
The point is to have a mailbox format where the mailbox can consist of one
or more files. Grouping multiple files in a single file makes it faster to
read, but it's slower to expunge mails from the beginning of the file. So
this format would allow sysadmin to specify rules on how large the files
would be allowed to grow.
This format is mostly designed to work nicely with separate index files
such as Dovecot has. The flags are stored in the mailbox files mostly as a
backup in case the index gets lost or corrupted.
Index File
Named "index".
Index file gives a mapping between UIDs <-> filenames:
Header: <uidvalidity> <last-used-uid>
Record: <uid-first> <uid-last> <filename>
Record: ..
eg.:
1127468129 5
1 3 msg.1
4 4 msg.4
All of the UIDs don't have to actually exist in the file if they have been
expunged. The record should be removed once all of the UIDs have been removed
from it (and the file itself been deleted).
The same file may appear multiple times in the index. For example:
1 2 msg.1
3 3 msg.3
4 5 msg.1
This means that UIDs 1, 2, 4 and 5 are in file msg.1, but UID 3 is in file
msg.3.
The filename doesn't matter as long as it's unique, but probably wouldn't
hurt to include its first UID in the filename so humans can find mails more
easily. The filename however must not end with ".lock".
The file is modified by creating first exclusively owned "index.lock" file,
updating it and then rename()ing it over the index file. This lock file
shouldn't be held for a long time, so if it exists, and it doesn't get
modified at all within 30 seconds, it can be overwritten. Before
rename()ing over the index file, you should check with stat() that the
index.lock file is still the same as you expect it to be (in case your
process was temporarily stopped for over 30 secs).
The index file's mtime timestamp must grow after each modification. This
makes it easy to check if new mail has been added to the mailbox.
Mail File
File header:
Bytes Format Description
8     hex    Header size
16    hex    Append offset
4     hex    Mail header size
4     hex    Mail header padding
4     hex    Keyword count in mail headers
Mail header:
Bytes Format Description
8     hex    UID
16    hex    Mail size
1     1/0    Answered flag
1     1/0    Flagged flag
1     1/0    Deleted flag
1     1/0    Seen flag
1     1/0    Draft flag
1     1/0    Expunged flag
n     1/0    Keywords
The file layout then goes:
<File header>

possible padding due to Header size
<Mail header #1>
possible padding due to Mail header size
<Mail data #1>
possible padding due to Mail header padding (ie. next Mail header's
offset is divisible with Mail header padding)
<Mail header #2>
etc..

The UIDs must be ascending within the file (but as said above, some UIDs
between them may exist in other files).
When the file is opened, it must be shared locked with flock(). Expunging
mails requires exclusive flock(), so expunges can't happen while someone is
still reading the file. Mails can however be marked with expunged-flag, and
such mails should be treated as if they are already expunged. Expunged flag
must not be removed. If the mail for some reason is wanted to be
un-expunged, it must be copied as a new mail with new UID to the mailbox.
Modifying flags doesn't require any locks. You'll simply write over those
flags you wish to change. That's why each flag takes one byte.
Safest way to expunge is to copy all non-expunged mails after the first
expunged one to new file. Then update the index file with the new location
for the UIDs, and at last ftruncate() at the position of first expunged
mail in the original file.
Alternative way to expunge is to just move the data over the expunged mails
and fruncate() at the end. No index file updates necessary. This is more
filesystem quota friendly, but if the expunging gets aborted in the middle
it will corrupt the mails.
Appending mails can be done by either locking existing file with a dotlock
and appending to it, or creating a new file. If appending to existing file,
also flock() it so expunging won't happen in the middle of append. If
appending to new file, create it with .lock suffix and then check that the
file without .lock wasn't already just created.
When appending to existing file, check "Append offset" from file's header.
If that offset is at the end of file, just append there. Otherwise read the
mail's header from there. If it has a non-zero UID and index file says that
the same UID exists in this file, jump over the mail and do the same checks
for all the rest of the mails in the file. Once the check doesn't match or
you reach EOF, update the append offset in the header to that offset and
write your mail there.
Initially write the UID in mail's header as zeroes ('0', not NUL). Once the
append is completely finished, you'll have to update it with the new UID.
You'll do this by locking index file, reading the latest last-used-uid and
using the next value for your UID. Once you've written it, rewrite the
index file with updated last-used-uid field and your UID added to it. After
this, update the append offset in the mail file and unlock it.
Keywords file
Named "index.keywords". This contains mappings between keyword numbers and
names. TODO: may be problematic with lockless flag writes..
Recent file
"index.recent.<uid>" file tells what the last non-recent UID was. The file
doesn't have any data inside it.
When opening the mailbox you should find and rename() the existing
index.recent.<uid> file to the new index.recent.<last-uid>. A successful
rename() means the mails between <uid>+1 .. <last-uid> are recent for you.
A failed rename() means that someone else managed to do it. You should look
for the file again and if it was renamed to less than <last-uid>, you
should try again.
Nightly cleanups
Because expunged mails may be left lying around for a while, there probably
should be some nightly check that cleans them up. Other cleanups that may be
necessary after crashes are:

.lock files
expunged data at the end of files (can be checked with same rules as how
appending determines the offset where to append)

One problem with nightly expunging is how to expunge mails if the file is
still locked by some process. Forced shutdowns may be annoying to users and
can be difficult to implement..
Alternative locking strategy
The locking strategy as described above is optimized for high concurrency:
Only if mail file is in the middle of being expunged, we're waiting on
locks. With appending we may be waiting for existing appends to finish if
we don't wish to create a new file for some reason. Other than those,
reading and flag updating are completely lockless.
The two downsides to this are:

flock() isn't available everywhere. It could however be replaced with
fcntl() if that's available. NFS may not have either.

Expunges can't happen while multiple processes are accessing the mailbox
at a same time. This may be a problem if the mailbox never has less than
two processes accessing it.

The alternative to flock()ing would be to simply handle unexpected
end-of-file reads as a sign of expunge, which would require re-reading
index file to see where the mails went (or if they were really expunged).
This has two problems however. One is that flag writes couldn't catch
expunges without locking, and another is that if you've already sent half a
mail to IMAP client, you have no other choice than to disconnect the client
if you can't send the rest of the mail.
It might be worth it optionally. Especially for NFS people. Then you'd
basically have the dotlock file that locks appends, expunges and flag
updates. Reading would still be lockless.
Also for the best performance the mail flags could be stored only in
Dovecot's index files. In that case there's no need to update flags and the
only downside with this locking strategy would be the potential
disconnections, which probably aren't all that bad.
Compatibility
If needed, it would be possible to create new/ and tmp/ directories under
the mailbox to allow mail deliveries to be in maildir format. The maildir
files would be then either just moved or maybe be appended to existing mail
files.

[Dovecot] New mailbox format

Generic

Index File

Mail File

Keywords file

Recent file

Nightly cleanups

Alternative locking strategy

Compatibility

John Peacock

John Peacock

Brian Bartholomew

Geo Carncross

Geo Carncross

Tomi Hakala

Quest

As I understand, it is built on GoogleFS and Google's map/reduce algorithm for eliminating duplicated storage. There are also some interesting FOSS hashing implementations optimized for things that smells like message headers.

dean gaudet

Brian Bartholomew

Tomi Hakala

Jakob Hirsch

Dan Hollis

Kenneth Porter

Dan Hollis

Tom Metro

dean gaudet

Compatibility

dean gaudet

Todd Vierling

dean gaudet

Marc Perkel

dean gaudet

Marc Perkel

dean gaudet

Marc Perkel

Dan Hollis

dean gaudet

Dan Hollis

Cor Bosman

dean gaudet

Brian Bartholomew

Dan Hollis

Tom Metro

Curtis Maloney

Todd Vierling

nodata

Todd Vierling

Jack Bailey

dean gaudet

Dan Hollis

Jack Bailey

Cor Bosman

Tom Metro

Index File

Dan Hollis

Tom Metro

Index File

Oliver Schulze L.

Generic

Index File

Mail File

Keywords file

Recent file

Nightly cleanups

Alternative locking strategy

Compatibility

tags

participants (19)