This is about how to implement multiple msgs/file dbox format. The current v1.1's one msg/file design would stay pretty much the same and it would be compatible with this new design.
dbox directories with multiple msgs/file would be like:
~/dbox/storage/ has the actual mail data for all mailboxes ~/dbox/mailboxes/ has subdirectories containing mailboxes and their indexes
Also since dbox supports already the single msg per file, those files would be stored in the mailboxes/ directory. So the idea would be that either you use multiple msgs per file using a global storage, or you use single msg per file without a global storage (or it's also possible to be in a mixed setup with some mails in storage/ and some in mailboxes/, mainly to allow migration between those configurations).
The storage/ directory would have a new "map index" which is a regular dovecot index (dovecot.index and dovecot.index.log). So the mailbox index would point to mails using an intermediary "map UID". This way if mails are moved to another file only the map index needs to be updated.
GUID would be a globally unique 128 bit ID for messages. So if map indexes get corrupted for any reason it's possible to rebuild it by finding the mails using GUIDs.
v1.1 dbox has this "dbox.index" file which I was originally planning on using with multiple msgs/file. It had complex file range locking stuff. Now I'm thinking that it's pretty much useless. The only reason for its existence with the new design is for listing metadata for files converted from Maildir.
Map index record would contain:
- 32 bit map UID
- 8 bit flags (MAIL_DELETED flag = message marked as expunged)
- 8 bit unused wasted space
- 16 bit refcount
- 32 bit file sequence
- 32 bit file offset --> total 128 bits/msg
Mailbox index:
- IMAP UID, flags, keywords, etc.
- 32 bit map UID
- 128 bit GUID
dbox file metadata: though.)
- 128 bit GUID
- size, vsize, received time, saved time, etc.
- initial mailbox name (if all indexes get trashed, we can still figure out at least one mailbox where to put the mail. copies would get lost
(- no map UID, no imap UID)
How to save a message with multiple msgs/file:
- Find dbox file where to append to: 1.1. Look up the last message from map index 1.2. Is the file "too old"? (or doesn't exist at all)
- Yes -> Create new dbox file 1.3. Is the file "too large"?
- Yes -> Look at the previous file (one sequence less) and goto 1.2. 1.4. Try to lock the file.
- Fail -> Look at prev file and goto 1.2.
Now we have a locked/new dbox file where we can write to. Because 1.4. step only tries to lock the file, there's no waiting on locks. This also means that if e.g. two processes are writing new messages rapidly they may be appending actively to two different files. I don't think that's a problem, better than waiting for locks.
2a) We're using an existing file and we need to find the append offset. Since we found the file by finding the last msg in the file, we also know the last message's offset. I wasn't really planning on saving the message sizes in the index file, so to get the append offset I guess it needs to do an extra read on the last msg's header to find the size and skip over it. Hmm. Or would it be less disk I/O to store the size on the index so it could be found directly? I'm not really sure..
In any case, after we find the append offset, check to see if it's at EOF. If not, that means that either another process just saved a new message there or a process crashed previously and left garbage lying around. Refresh the map index to see if this file+offset exists in it. If not, truncate the file and just continue writing there. If it exists, figure out the new append offset and see again if the file limit would be reached. If the file would become too large, unlock the file and goto step 1.
2b) We're writing to a new file. No need to worry about anything in 2a)
Write the message and its metadata to dbox file (including generated 128 bit GUID).
Assign map UIDs for the written mails and write APPEND records to map index's transaction log. The record would contain the map UID, file seq, offset, refcount=1. The transaction is saved with a "weak" flag (wonder if there's a better name for this) and its offset is remembered.
- If we're creating a new dbox file, it's assigned the file seq and rename()d to the final file name while the map index is locked.
Write APPEND record to mailbox index's transaction log with IMAP UID, map UID and GUID (and flags, keywords, etc).
Write "commit offset=x" record where x is the offset remembered in step 4. This marks the 4's weak transaction as being fully finished.
dbox file is unlocked (if we weren't creating a new file).
When reading the index and we see a weak transaction without a commit record, call a resolve() function in dbox code. It finds the dbox file in the weak transaction and tries to lock it. If it can't lock it, it (probably) means that there's still a process that's going to finish the save. If a locking succeeded, run a fixup code that goes through all files that are referenced in weak transactions and makes sure all the refcounts and such are valid. Then it writes commit records for each such transaction.
Copying can be done by locking source dbox file, writing a weak "refcount++" transaction and continuing from saving step 5.
The only step I can think of where there's a small problem is if data is written to dbox file, then process dies, but no saving is attempted to the same file later. The data is left there as garbage until maybe some day later when a message is expunged from that file and cleanup code notices that there's extra data. I thought about reordering steps so that the records are written to map index first, but that's less efficient when writing multiple messages, because each one would need to be saved separately to the map index (where each save would be lock + fstat (+ read) + write + unlock). This is probably a rare problem which doesn't really matter..
Expunging:
The "expunge" and "cleanup to remove deleted space" are separate steps. Preferably the cleanup would be done when the server is "less busy", but I guess it could also be configured to happen either immediately or during logout (or once an hour or so).
- Write "want to expunge" record to mailbox's transaction log.
- Write a weak "refcount--" to map index's transaction log.
- Write "expunged" record to mailbox's transaction log.
- Write "commit" for the weak refcount--.
(The two-stage want-to-expunge(1)/expunged(3) is already done for maildir/mbox also.)
I was also thinking maybe this could be just 3 steps without a weak transaction. Especially since without a fsync() between 2 and 3 there's really no guarantee about the order in which 2 and 3 happen. And fsync()s are kind of annoying. But then again if that isn't done and it crashes between 2 and 3 the expunge is tried again later which now decrements the refcount again. If the message's original refcount was 1 it would now be negative. That's not really a problem. More problematic is if the message had been copied to another mailbox and refcount=2, now the refcount would be 0 and the message in the other mailbox would get unexpectedly lost. Not good.
Space deletion cleanup:
Check from map index header to see there are any refcount=0 mails. Although having this step probably requires some annoying dbox-specific code in index handling code. I guess I'll have to see later if there's a non-horribly-ugly way to do this.
Go through all records in the map index. Get a list of files that have refcount=0 records. Also make a copy of file seq+offsets and sort them (so at step 3 we can easily find what mails exist).
For each file that has refcount=0 records:
3.1. Try to lock the file. -> Fail: Skip to next file. Either someone's appending to it or cleaning it up already. 3.2. Refresh map index. Add any newly added records to the sorted records list (rare). 3.2. Find the first refcount=0 mail. 3.3. Copy all mails after that with refcount>0 to a new file. 3.4. Lock map index, assign new file seq for the new file, rename() it, for each mail we just copied, using a weak transaction write the seq +offset updates to map index's transaction log. Unlock map index. 3.5. Depending on if first refcount=0 mail was the first mail, either unlink() the file or ftruncate() it. 3.6. Write "commit" record to finish the weak transaction.
That's a bit inefficient when cleaning up multiple dbox files, but I'm not sure if there's a much better way unless the map index is locked for the entire time. Since the copying might take a while, a long map index lock is bad since it prevents messages from being saved.
Reading mails can be done without locking. But since a file can be truncated unexpectedly, the code must be able to find the new file and offset in that situation and continue reading from there. If the message we were reading was just expunged+cleaned up, the client will have to be just disconnected. Since the cleanup shouldn't normally happen immediately after expunge, this should happen rarely/never.
Oh and about all these new temp files that are created. When dbox is opened the dir is stat()ed and if its atime is older than 1 week (currently hardcoded) it readdir()s through the dir to see if there are any old temp* files and then unlinks them.
With the new design flags/keywords would never be written to actual dbox files. This helps with backups and also avoids the horribleness of having to increase metadata area. To avoid complete data loss due to index corruption (which really shouldn't be happening nowadays), the indexes are also once in a while backed up to e.g. dovecot.index.backup file. If Dovecot detects missing/broken index, it uses this backup copy to rebuild the index as well as it can.
I'm also wondering if it's better for each mailbox to have its separate dovecot.index.cache file or if there should be one cache file for the map index. The upside with one index is that if a mail is copied there's no need to duplicate its cached data. The downside is that I'm worried it would cause more disk I/O, since mailboxes are typically accessed separately, so it might be better that the cache data for one mailbox is stored together rather than all around one bigger cache file.