[Dovecot] dbox redesign
This is about how to implement multiple msgs/file dbox format. The current v1.1's one msg/file design would stay pretty much the same and it would be compatible with this new design.
dbox directories with multiple msgs/file would be like:
~/dbox/storage/ has the actual mail data for all mailboxes ~/dbox/mailboxes/ has subdirectories containing mailboxes and their indexes
Also since dbox supports already the single msg per file, those files would be stored in the mailboxes/ directory. So the idea would be that either you use multiple msgs per file using a global storage, or you use single msg per file without a global storage (or it's also possible to be in a mixed setup with some mails in storage/ and some in mailboxes/, mainly to allow migration between those configurations).
The storage/ directory would have a new "map index" which is a regular dovecot index (dovecot.index and dovecot.index.log). So the mailbox index would point to mails using an intermediary "map UID". This way if mails are moved to another file only the map index needs to be updated.
GUID would be a globally unique 128 bit ID for messages. So if map indexes get corrupted for any reason it's possible to rebuild it by finding the mails using GUIDs.
v1.1 dbox has this "dbox.index" file which I was originally planning on using with multiple msgs/file. It had complex file range locking stuff. Now I'm thinking that it's pretty much useless. The only reason for its existence with the new design is for listing metadata for files converted from Maildir.
Map index record would contain:
- 32 bit map UID
- 8 bit flags (MAIL_DELETED flag = message marked as expunged)
- 8 bit unused wasted space
- 16 bit refcount
- 32 bit file sequence
- 32 bit file offset --> total 128 bits/msg
Mailbox index:
- IMAP UID, flags, keywords, etc.
- 32 bit map UID
- 128 bit GUID
dbox file metadata: though.)
- 128 bit GUID
- size, vsize, received time, saved time, etc.
- initial mailbox name (if all indexes get trashed, we can still figure out at least one mailbox where to put the mail. copies would get lost
(- no map UID, no imap UID)
How to save a message with multiple msgs/file:
- Find dbox file where to append to: 1.1. Look up the last message from map index 1.2. Is the file "too old"? (or doesn't exist at all)
- Yes -> Create new dbox file 1.3. Is the file "too large"?
- Yes -> Look at the previous file (one sequence less) and goto 1.2. 1.4. Try to lock the file.
- Fail -> Look at prev file and goto 1.2.
Now we have a locked/new dbox file where we can write to. Because 1.4. step only tries to lock the file, there's no waiting on locks. This also means that if e.g. two processes are writing new messages rapidly they may be appending actively to two different files. I don't think that's a problem, better than waiting for locks.
2a) We're using an existing file and we need to find the append offset. Since we found the file by finding the last msg in the file, we also know the last message's offset. I wasn't really planning on saving the message sizes in the index file, so to get the append offset I guess it needs to do an extra read on the last msg's header to find the size and skip over it. Hmm. Or would it be less disk I/O to store the size on the index so it could be found directly? I'm not really sure..
In any case, after we find the append offset, check to see if it's at EOF. If not, that means that either another process just saved a new message there or a process crashed previously and left garbage lying around. Refresh the map index to see if this file+offset exists in it. If not, truncate the file and just continue writing there. If it exists, figure out the new append offset and see again if the file limit would be reached. If the file would become too large, unlock the file and goto step 1.
2b) We're writing to a new file. No need to worry about anything in 2a)
Write the message and its metadata to dbox file (including generated 128 bit GUID).
Assign map UIDs for the written mails and write APPEND records to map index's transaction log. The record would contain the map UID, file seq, offset, refcount=1. The transaction is saved with a "weak" flag (wonder if there's a better name for this) and its offset is remembered.
- If we're creating a new dbox file, it's assigned the file seq and rename()d to the final file name while the map index is locked.
Write APPEND record to mailbox index's transaction log with IMAP UID, map UID and GUID (and flags, keywords, etc).
Write "commit offset=x" record where x is the offset remembered in step 4. This marks the 4's weak transaction as being fully finished.
dbox file is unlocked (if we weren't creating a new file).
When reading the index and we see a weak transaction without a commit record, call a resolve() function in dbox code. It finds the dbox file in the weak transaction and tries to lock it. If it can't lock it, it (probably) means that there's still a process that's going to finish the save. If a locking succeeded, run a fixup code that goes through all files that are referenced in weak transactions and makes sure all the refcounts and such are valid. Then it writes commit records for each such transaction.
Copying can be done by locking source dbox file, writing a weak "refcount++" transaction and continuing from saving step 5.
The only step I can think of where there's a small problem is if data is written to dbox file, then process dies, but no saving is attempted to the same file later. The data is left there as garbage until maybe some day later when a message is expunged from that file and cleanup code notices that there's extra data. I thought about reordering steps so that the records are written to map index first, but that's less efficient when writing multiple messages, because each one would need to be saved separately to the map index (where each save would be lock + fstat (+ read) + write + unlock). This is probably a rare problem which doesn't really matter..
Expunging:
The "expunge" and "cleanup to remove deleted space" are separate steps. Preferably the cleanup would be done when the server is "less busy", but I guess it could also be configured to happen either immediately or during logout (or once an hour or so).
- Write "want to expunge" record to mailbox's transaction log.
- Write a weak "refcount--" to map index's transaction log.
- Write "expunged" record to mailbox's transaction log.
- Write "commit" for the weak refcount--.
(The two-stage want-to-expunge(1)/expunged(3) is already done for maildir/mbox also.)
I was also thinking maybe this could be just 3 steps without a weak transaction. Especially since without a fsync() between 2 and 3 there's really no guarantee about the order in which 2 and 3 happen. And fsync()s are kind of annoying. But then again if that isn't done and it crashes between 2 and 3 the expunge is tried again later which now decrements the refcount again. If the message's original refcount was 1 it would now be negative. That's not really a problem. More problematic is if the message had been copied to another mailbox and refcount=2, now the refcount would be 0 and the message in the other mailbox would get unexpectedly lost. Not good.
Space deletion cleanup:
Check from map index header to see there are any refcount=0 mails. Although having this step probably requires some annoying dbox-specific code in index handling code. I guess I'll have to see later if there's a non-horribly-ugly way to do this.
Go through all records in the map index. Get a list of files that have refcount=0 records. Also make a copy of file seq+offsets and sort them (so at step 3 we can easily find what mails exist).
For each file that has refcount=0 records:
3.1. Try to lock the file. -> Fail: Skip to next file. Either someone's appending to it or cleaning it up already. 3.2. Refresh map index. Add any newly added records to the sorted records list (rare). 3.2. Find the first refcount=0 mail. 3.3. Copy all mails after that with refcount>0 to a new file. 3.4. Lock map index, assign new file seq for the new file, rename() it, for each mail we just copied, using a weak transaction write the seq +offset updates to map index's transaction log. Unlock map index. 3.5. Depending on if first refcount=0 mail was the first mail, either unlink() the file or ftruncate() it. 3.6. Write "commit" record to finish the weak transaction.
That's a bit inefficient when cleaning up multiple dbox files, but I'm not sure if there's a much better way unless the map index is locked for the entire time. Since the copying might take a while, a long map index lock is bad since it prevents messages from being saved.
Reading mails can be done without locking. But since a file can be truncated unexpectedly, the code must be able to find the new file and offset in that situation and continue reading from there. If the message we were reading was just expunged+cleaned up, the client will have to be just disconnected. Since the cleanup shouldn't normally happen immediately after expunge, this should happen rarely/never.
Oh and about all these new temp files that are created. When dbox is opened the dir is stat()ed and if its atime is older than 1 week (currently hardcoded) it readdir()s through the dir to see if there are any old temp* files and then unlinks them.
With the new design flags/keywords would never be written to actual dbox files. This helps with backups and also avoids the horribleness of having to increase metadata area. To avoid complete data loss due to index corruption (which really shouldn't be happening nowadays), the indexes are also once in a while backed up to e.g. dovecot.index.backup file. If Dovecot detects missing/broken index, it uses this backup copy to rebuild the index as well as it can.
I'm also wondering if it's better for each mailbox to have its separate dovecot.index.cache file or if there should be one cache file for the map index. The upside with one index is that if a mail is copied there's no need to duplicate its cached data. The downside is that I'm worried it would cause more disk I/O, since mailboxes are typically accessed separately, so it might be better that the cache data for one mailbox is stored together rather than all around one bigger cache file.
Timo Sirainen wrote:
This is about how to implement multiple msgs/file dbox format. The current v1.1's one msg/file design would stay pretty much the same and it would be compatible with this new design.
Out of curiosity, what's the advantage to going to multiple messages per file? Wouldn't this have the same problems as mbox?
~Seth
On Wed, 2009-02-11 at 14:32 -0800, Seth Mattinen wrote:
Timo Sirainen wrote:
This is about how to implement multiple msgs/file dbox format. The current v1.1's one msg/file design would stay pretty much the same and it would be compatible with this new design.
Out of curiosity, what's the advantage to going to multiple messages per file? Wouldn't this have the same problems as mbox?
Multiple per file, not everything in one file. As long as the file size is set "right", it's probably faster than one per file. We'll see :)
On Wed, 2009-02-11 at 17:35 -0500, Timo Sirainen wrote:
On Wed, 2009-02-11 at 14:32 -0800, Seth Mattinen wrote:
Timo Sirainen wrote:
This is about how to implement multiple msgs/file dbox format. The current v1.1's one msg/file design would stay pretty much the same and it would be compatible with this new design.
Out of curiosity, what's the advantage to going to multiple messages per file? Wouldn't this have the same problems as mbox?
Multiple per file, not everything in one file. As long as the file size is set "right", it's probably faster than one per file. We'll see :)
Also there are no locking issues since reading doesn't require locking and write locks are very short lived. Corruption isn't possible because data is never copied within a file. A crash can happen at any point and Dovecot will be able to recover from it 100%. The worst that can happen is that some extra garbage is left lying around for some time wasting disk space.
I would add that having fewer, larger files should make backups much more feasible. There's a certain amount of overhead for each file operation (especially for us GFS people!) and reducing the number of files will reduce that overhead.
Right now our backups (done via rsync) take a pretty scary amount of time, only to get worse as the size of the mailstore (currently 200G) grows.
Personally I'm pretty excited about dbox.
Allen
Timo Sirainen wrote:
On Wed, 2009-02-11 at 14:32 -0800, Seth Mattinen wrote:
Timo Sirainen wrote:
This is about how to implement multiple msgs/file dbox format. The current v1.1's one msg/file design would stay pretty much the same and it would be compatible with this new design.
Out of curiosity, what's the advantage to going to multiple messages per file? Wouldn't this have the same problems as mbox?
Multiple per file, not everything in one file. As long as the file size is set "right", it's probably faster than one per file. We'll see :)
-- Allen Belletti allen@isye.gatech.edu 404-894-6221 Phone Industrial and Systems Engineering 404-385-2988 Fax Georgia Institute of Technology
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Thu, 12 Feb 2009, Allen Belletti wrote:
I would add that having fewer, larger files should make backups much more feasible. There's a certain amount of overhead for each file
That's true for full backups. I don't defend Maildir, esp. because it changes the filename, which is a new file for any backup software (which are usually not Maildir aware in my case).
operation (especially for us GFS people!) and reducing the number of files will reduce that overhead.
Also, makes partial recoveries problematic.
Right now our backups (done via rsync) take a pretty scary amount of
Yep.
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iQEVAwUBSZqWHHWSIuGy1ktrAQL7lggAnWUfIIPwVj3xV4csIVl0ayVCn2lgBGzG lRIg+OzGRbpZx9uvwJpPRtJS7TphFTBmctvdvL22NROGaXJh0bmvdfgeXnUf+IG/ YJqNEGN5j/yUAHON4l9hMnv9JjgWwSIFKCUZJ7MFVJpohpPLXJoxDt+AYyb+d+44 GImHDgpcyr0089Asv6FN8Q4rzGIxQAvdIx/n/nMeQ77ZVnbTJDtrcuNUywcV7Hqq lYbEX83ikq206QSJjmwM1j6w5n+PAsHWE8UJdmmpP/7vemsg3KDVkhaMcCfhLyL4 FqDZOwuhhsVEjykLfgbx6onJ1bon7u987eqJt5yv1NYRGU+NoCVUDQ== =+s98 -----END PGP SIGNATURE-----
Hi Timo
I have a few comments. Please just disregard them if I have misunderstood your design.
Regarding your storage plan I find it very important that users can be stored in different locations because:
- Discount users could be placed on cheap storage while others are offered premium service on expensive hardware
- It's easy to scale if you just add another LUN from your SAN or mount from NAS
- In order to avoing huge directories you can put users into subdirs with each subdir containing only say 1000 users each All this is very easy to achieve in 1.1 because you can return individual storage dirs for indexes and data from the user db. I'm not sure from reading your post whether this will still be possible but I believe it’s a very important thing.
Regarding 7. I very much for all the self healing you describe. There is nothing worse than huge complex systems that fail just because of some minor error that could easily be fixed without manual intervention. But also I'm a little worried in this regard.
Maildir is so robust that nothing can really go wrong. But here you have index files and data files located in different places. Imagine the index file being on one NFS mount whilst the data resides on another. Or if the administrator is purposely loading a different index file or data file from a backup.
Worst case scenario is that the self healing takes a manual operation for a failure and breaks something. It should be very resilient to temporarily losing access to all files in this operation (could happen very often on NFS mounts).
Also I imagine the self-healing going into loops if it doesn't understand what’s going on. If the data changes dues to manual intervention or par of the file system can be accessed you could imagine the self healing process trying again and again to fix something that isn't its job to fix. In that case it would be better if it just skipped the apparent failures.
I'm also wondering if it's better for each mailbox to have its separate dovecot.index.cache file or if there should be one cache file for the map index. I think you should consider more files as the general choice (not only regarding cache files). Imagine many dovecot servers accessing the same storage simultaneously. I figure it would be a lot easier if they weren’t all trying to read/update one essential file at the same time (with only one file, load can’t be spread across multiple mounts and everything goes down if
Timo wrote: the mount with the essential file is inaccessible). If there is serious data corruption and you have only one file then all operations are paused while the self healing is trying to figure out what went wrong (and what happens if different servers decide to do self-healing on this one file at the same time?). With one file per maildir only a small portion of the users are affected, the load is spread and really bad file corruption doesn’t break everything for thousands of users.
Other than that I’m just really glad that dbox is progressing. I consider it the feature. Dbox is the email administrator’s wet dream. I’m already dreaming of completely avoiding the scalability issues of large Maildirs (which is the biggest challenge today in my opinion) and reducing the IO. Buying more IO is an order of magnitude more expensive than getting more RAM or CPU power (and dovecot barely needs any RAM and CPU anyway).
Best wishes, Mikkel
On Thu, 2009-02-12 at 11:29 +0100, Mikkel wrote:
Hi Timo
I have a few comments. Please just disregard them if I have misunderstood your design.
Regarding your storage plan I find it very important that users can be stored in different locations because:
This you misunderstood. The mails of a single user are stored in one dbox directory, not all users.
Regarding 7. I very much for all the self healing you describe. There is nothing worse than huge complex systems that fail just because of some minor error that could easily be fixed without manual intervention. But also I'm a little worried in this regard.
Maildir is so robust that nothing can really go wrong.
Yes. If you don't care that much about performance Maildir is going to be more reliable, especially when recovering from filesystem corruption.
It should be very resilient to temporarily losing access to all files in this operation (could happen very often on NFS mounts).
I/O errors and such are treated differently than corrupted/missing files. So as long as reading gives an error it doesn't try to repair anything.
Also I imagine the self-healing going into loops if it doesn't understand what’s going on. If the data changes dues to manual intervention or par of the file system can be accessed you could imagine the self healing process trying again and again to fix something that isn't its job to fix. In that case it would be better if it just skipped the apparent failures.
I'm not really sure what you're thinking about here. Assuming there aren't bugs in the fixup code, it should be able to fix things. If someone manually goes and breaks things again, then sure it fixes them again later, but there's really no automatic looping. Also Dovecot already does index file fixing if it notices corruption, so this won't be all that much different.
If there is serious data corruption and you have only one file then all operations are paused while the self healing is trying to figure out what went wrong
There will be multiple files even per user, but yes, if corruption is noticed then the user is blocked until the corruption is fixed.
(and what happens if different servers decide to do self-healing on this one file at the same time?).
The same as if two processes in one server decide to self-heal: Locking prevents it from happening.
participants (5)
-
Allen Belletti
-
Mikkel
-
Seth Mattinen
-
Steffen Kaiser
-
Timo Sirainen