I don't think anyone uses dbox currently, so the whole format could still be redesigned. So I was thinking about doing two major changes:
Rely on index files a lot more. The flags are already stored in index files, so there's no need to waste I/O updating them to dbox files all the time. They could still be updated (if indexes get deleted, the flags aren't all gone), but less often.
Require fcntl() locking. Currently dbox uses dotlocks which is slow.
Cydir could be a good alternative also once index file code is made a bit more robust. Perhaps I could implement single instance attachments for cydir too..
Locking
The current dbox "index" file would be gone. It's pretty useless. Replace it with a whole new index file. Or perhaps it should be called "locks" file or something.
The locks file would contain records: <file id> <lock timestamp> <status>. So something like:
<header> 1 4645a60f 0 2 00000000 N 3 00000000 D
That would mean that the first file is locked by some process, either for appending or expunging. The 2nd and 3rd files aren't locked. New messages can't be appended to 2nd file anymore. 3rd file is already deleted and this record needs to be removed when rewriting the file.
Locking a dbox file for either appends or expunges is done like:
- See if timestamp is zero
- If not, see if it's older than .. let's say a day or so .. a) Yes: Continue to 2. b) No: Assume the file is locked
- Do fcntl() byte range lock over the record line in the locks file.
- If it failed, the record is locked
- Write the timestamp.
- Compare stat() and fstat() inodes to see if the file was rebuilt
- If yes, reopen the file and goto 1
- File is now locked. Do the append/expunge.
- Write timestamp to zero.
- Unlock the byte range.
If a file is locked, append will try another file and expunge will mark the message as expunged instead of actually expunging it yet.
Note that the locks file is read without locking. This is safe because data is never moved within the file, and it doesn't matter if the timestamp isn't read correctly always. The timestamp check is only an optimization. Actually I'm not sure if it would be better not to have the timestamp at all.
Deleted records will stay in the file until the file is rebuilt. If a deleted record is noticed in the file, the process tries to lock the whole locks file. If it succeeds, it proceeds with writing the non-deleted records to a temporary file and rename()ing it over the locks file.
Appending
- Find the first file in the locks file that has appendable=0
- If no such file was found, go to "create a new file" logic as described below
- Lock the file record
- Verify from the file's headers that this file can actually be appended to
- If messages have been expunged from a dbox file, it can't be safely appended to anymore.
- Other reasons include eg. configurable max. file size and daily rotations
- Write the mail
- Lock locks file's header
- Get the UID from "next uid" field and update it
- Unlock the header
- Unlock the file record
- Update index file
Create a new file logic:
- Create a temporary file
- Write the messages there
- Lock locks file's header (including fstat() / stat() rebuild check)
- See what the latest file ID is in the file
- rename() temp fail to msg.<latest file id+1>
- Lock locks file for the range of the to-be-written record below
- Write the new record to locks file
- Go to step 6 in the original append logic
Syncing / expunging
If locks file's header's "next uid" doesn't match the one currently in index file, the appending crashed between steps 6 and 9. Find the new message(s) and append them to index file.
If the locks file is completely gone , rebuild it by going through all the msg.* files in the directory.
If "expunge counter" (see below) doesn't match in locks file's header vs. index file header, go through all the msg.* files to see if a message exists in multiple files. If found, remove the duplicates.
Typically neither of the above happens, so the only thing to do here is to write changes from index file to dbox files. This may mean flag changes once in a while, but most importantly expunges will always be synced.
Initially figure out what files require expunging. Try to create a single lock range that includes all of them. If there are non-zero lock timestamps in that range, create multiple ranges.
If a file couldn't be locked, the expunge is done by updating expunge-flag in the file. This can be done without locking (see below for flag updates).
If a file was successfully locked, the expunging is done by:
Update "expunge offset" in file header to the offset of the first expunged message. If expunge offset is non-zero, the file is treated as non-appendable. Also when rebuilding and finding the same message from multiple files, this field is used to figure out which file should be truncated.
Copy the rest of the non-expunged messages to a new temporary file and add the file to locks file using the "Create a new file" logic described for appends. Instead of updating "next uid" field, update "expunge counter" field.
ftruncate() the old file.
Update "expunge counter" in index file.
Sync locking
If there is one big "sync lock", then flags can be updated in any file with relying only on that lock. Some files may be locked, but they're locked for appends so flag updates (and other record updates as well) are safe.
It would be interesting to make it work without that sync lock though, so that multiple processes could be syncing dbox at the same time. But there probably isn't much point. If the sync is locked, there's no real requirement to do the syncing at all then, it's just left to be done later.
If multiple processes were syncing at the same time, the only problem would come from flag updates. Either the files would have to be locked for flag updates (with waiting locks, bad bad), or it would need to be able to handle the situation when a flag was written just when another process expunged messages from the file and truncated it. That could be done by checking (one way or another) after the flag write if a truncation had happened, and if so do the truncation again.
Preventing out of sync disconnects
Reading messages from dbox files requires no locking, so it's possible that while one session is reading a message another session goes and expunges the message by truncating the file. The only way to handle this situation is to disconnect the client.
Also even if the message wasn't in the middle of being read, this kind of a out-of-sync could happen if a session hasn't synced itself yet (no NOOP/IDLE/etc. for a while) and tries to begin reading an expunged message.
These could be avoided if the expunging process knew that there are other processes that still know about the message, so it could just mark the message as expunged instead of actually doing it.
I think this could be done with yet another file that tracks what messages are expunged. The file would contain "dovecot.index.log file sequence + offset" records. Whenever message(s) are expunged, a new seq +offset is written containing the position in the log file for that transaction.
All the processes accessing the mailbox would then read lock the file from eof to infinity. The lock is updated whenever mailbox is synced (and being careful not to have race conditions there).
So syncing process would then try if it can write lock the file from "last real expunge" log position to end of file. If not, then using some binary search find the position how far it can be locked. If it managed to update the write lock position at least some, the messages up to that far in the transaction log can be expunged safely.
One problem with this is that the file needs to be rotated once in a while so that it won't grow infinitely. This can be done only if the write lock can include the whole file (including infinity). And that can be done only if there's a single process having the mailbox open, which may never happen with shared mailboxes.
I think I'll leave this feature to be implemented later. It's extra overhead that's probably rarely useful.
Single instance attachments
I'll change the dbox messages to have variable sized headers. The header could then contain information such as "at offset 5000 insert 11241242 bytes from file AB123427852".
Fast copying
Would be nice if copying a message from one mailbox to another wouldn't require actually reading+writing the whole message contents. But I can't really figure out how to implement this without requiring that there is only a single dbox storage which contains the mails for all the mailboxes, and the mailboxes themselves are just Dovecot's index files containing pointers to the dbox storage.
The problem with having everything in one storage is that if the index files are broken, the messages can't be placed into correct mailboxes anymore.
Although one possibility would be treat mailboxes a bit similarly than keywords. So that when a message is copied to another mailbox, the message in dbox file is updated to contain information that it exists in such and such mailboxes. Hmm. Perhaps that would be good enough, yes.