Updated version. The previous one was a bit too complicated with flag updates for pretty minimal benefits. Changes:
- writing/reading mail flags can be disabled to get better performance without it requiring any changes to how the files are normally handled (so mixed more works fine).
- writing flags now requires getting shared locks. so there's the overhead of locking but flag writers don't have to wait for each others.
- reading requires no locking anymore. the comment in the previous version about reading only half a mail being a problem doesn't actually matter, since if a client tries to read an expunged mail Dovecot already disconnects the client immediately (since most clients can't handle such situation well anyway).
- no more problems with mailbox readers locking out expungers
- note about fcntl() locking being able to provide even higher concurrency
- works just fine with plain dotlocks, so works fine with NFS
And looks like I forgot to explain the "mail header padding" before. I'm not exactly sure if it's useful or not, but it doesn't hurt to have it there if it's wanted. The idea was anyway that if you're writing more than one flag at a time, having a 512 byte padding would make sure that when writing the few bytes the OS will only have to read one sector from disk (512 bytes in all todays OSes I believe?) to get the non-written parts of the sector instead of possibly reading two sectors. So it might help with trading disk reads for some wasted disk space.
Generic
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
Index File
Named "index".
Index file gives a mapping between UIDs <-> filenames:
Header: <uidvalidity> <last-used-uid> Record: <uid-first> <uid-last> <filename> Record: ..
eg.:
1127468129 5 1 3 msg.1 4 4 msg.4
All of the UIDs don't have to actually exist in the file if they have been expunged. The record should be removed once all of the UIDs have been removed from it (and the file itself been deleted).
The same file may appear multiple times in the index. For example:
1 2 msg.1 3 3 msg.3 4 5 msg.1
This means that UIDs 1, 2, 4 and 5 are in file msg.1, but UID 3 is in file msg.3.
The filename doesn't matter as long as it's unique, but probably wouldn't hurt to include its first UID in the filename so humans can find mails more easily. The filename however must not end with ".lock".
The file is modified by creating first exclusively owned "index.lock" file, updating it and then rename()ing it over the index file. This lock file shouldn't be held for a long time, so if it exists, and it doesn't get modified at all within 30 seconds, it can be overwritten. Before rename()ing over the index file, you should check with stat() that the index.lock file is still the same as you expect it to be (in case your process was temporarily stopped for over 30 secs).
The index file's mtime timestamp must grow after each modification. This makes it easy to check if new mail has been added to the mailbox.
Mail File
File header:
Bytes Format Description 8 hex Header size 16 hex Append offset 4 hex Mail header size 4 hex Mail header padding 4 hex Keyword count in mail headers
Mail header:
Bytes Format Description 8 hex UID 16 hex Mail size 1 1/0 Answered flag 1 1/0 Flagged flag 1 1/0 Deleted flag 1 1/0 Seen flag 1 1/0 Draft flag 1 1/0 Expunged flag n 1/0 Keywords
The file layout then goes:
<File header> - possible padding due to Header size <Mail header #1> - possible padding due to Mail header size <Mail data #1> - possible padding due to Mail header padding (ie. next Mail header's offset is divisible with Mail header padding) <Mail header #2> etc..
The UIDs must be ascending within the file (but as said above, some UIDs between them may exist in other files).
If mail has been marked with expunged-flag, it should be treated as if it was already expunged. The mail will really be expunged later in some nightly run.
Reading the files requires no locking. If while reading you happen to hit unexpected end of file, it means that mails were expunged from the file. After that you'll look at the index file again to see which file the mail went and continue from there on. If the mail isn't found, it means it got expunged.
Expunging happens by locking the file and then copying all non-expunged mails after the first expunged one to new file. Then update the index file with the new location for the UIDs, and at last ftruncate() at the position of first expunged mail in the original file.
Flag updates require locking the file as well, otherwise we might try to write past the end of file when another process is expunging the mails. Shared locking can be used for flag update locking since each flag update can be atomically changed by writing one byte.
Appending mails can be done by either locking existing file and appending to it, or creating a new file. If appending to a new file, create it with .lock suffix and then check that the file without .lock wasn't already just created to prevent race conditions.
When appending to existing file, check "Append offset" from file's header. If that offset is at the end of file, just append there. Otherwise read the mail's header from there. If it has a non-zero UID and index file says that the same UID exists in this file, jump over the mail and do the same checks for all the rest of the mails in the file. Once the check doesn't match or you reach EOF, update the append offset in the header to that offset and write your mail there.
Initially write the UID in mail's header as zeroes ('0', not NUL). Once the append is completely finished, you'll have to update it with the new UID. You'll do this by locking index file, reading the latest last-used-uid and using the next value for your UID. Once you've written it, rewrite the index file with updated last-used-uid field and your UID added to it. After this, update the append offset in the mail file and unlock it.
If fcntl() locking is used, locking only specific ranges can give more concurrency. Flag updates only need to lock the part of the file they're modifying. Expunges need to lock the part which it's removing. Appending needs to lock from <end of file-1> to end of file (ie. l_len=0).
Keywords file
Named "index.keywords". This contains mappings between keyword numbers and names. TODO: write how it's updated
Recent file
"index.recent.<uid>" file tells what the last non-recent UID was. The file doesn't have any data inside it.
When opening the mailbox you should find and rename() the existing index.recent.<uid> file to the new index.recent.<last-uid>. A successful rename() means the mails between <uid>+1 .. <last-uid> are recent for you. A failed rename() means that someone else managed to do it. You should look for the file again and if it was renamed to less than <last-uid>, you should try again.
Nightly cleanups
Because expunged mails may be left lying around for a while, there probably should be some nightly check that cleans them up. Other cleanups that may be necessary after crashes are:
- .lock files
- expunged data at the end of files (can be checked with same rules as how appending determines the offset where to append)
Alternative ways to use
Expunging mails isn't very filesystem quota friendly since it requires creating another file to copy the expunged mails to. This could also be done by just moving the data within the file. This however requires that all file reads must also be shared-locked. Also it has the problem of leaving the mailbox corrupted if the copying gets aborted in the middle.
For the best performance the mail flags could be stored only in Dovecot's index files. In that case the flag update locking isn't slowing down expunges (and if fcntl() locks can't be used, doesn't slow down appends either). Also there's no need for keywords/recent files.
Compatibility
If needed, it would be possible to create new/ and tmp/ directories under the mailbox to allow mail deliveries to be in maildir format. The maildir files would be then either just moved or maybe be appended to existing mail files.