I just had to write this even while it's not just yet in CVS, but it's only a few fixes away from being committed..
My year long dream has finally came true :)
[Summary: Next release will have REALLY kickass indexes]
Previously the biggest problem with caching message data in indexes was that we had to do it while syncing the mailbox, before client even saw those messages. This caused index rebuilds to be very slow since it had to go through all mails and cache all of them, even if client never would access them. Alternative was to not cache anything at all, but that wasn't really good idea either.
Now Dovecot caches the data when client is actually requesting it. That way it will never do any extra work for caching something that client isn't currently interested in. There's still the anticipatory caching for data that client is likely to be interested in later, but that is done only when it doesn't cause extra disk I/O.
The reason why it took this long was because it was previously too difficult with read/write locking that indexes required. I couldn't have relied on it working since changing from read lock -> write lock could deadlock and dropping the lock in the middle would have caused even more problems.
So, what I did was to rewrite the cache file handling completely:
- It doesn't require any locking to read from it.
- It will be NFS-safe without being too costly.
- The file format itself is now architecture-independent.
- Uses less space than before
- OpenBSD support is now worse than before, even worse than NFS-support. They should get that unified cache done.
Cached data is also selected better:
- Specifically requested message headers are cached (FETCH, SEARCH, SORT, THREAD, etc.)
- IMAP ENVELOPE isn't treated in any special way anymore. It's treated just as if you had requested HEADER.FIELDS (Date Subject From ...etc.).
- Maildir filenames aren't constantly updated in the cache file anymore. This takes more memory now, but should reduce disk I/O.
There's a few more tweaks that I'll probably add later:
- Maildir: When compressing cache file we update mail's filename to current one so if message's flags aren't changed often, it won't use any extra memory either.
- A normal text/plain message doesn't need to have it's IMAP BODYSTRUCTURE cached. It only needs a single bit set in flags. Should reduce used space quite a lot.
- Message with only one body part doesn't need to have it's body structure stored at all since it's mostly just useful for fetching body parts. Plus the body structure could be generated if message sizes were known.
All this work was only for index cache file (ex. .data file). There's still the main .imap.index file and modify log files.
Modify log should be easy to make NFS-safe - I'm currently using some stupid file locking to figure out when the file is safe to overwrite, but I should have simply replaced the old file with rename(). That will solve some other problems as well as make the code simpler. It should be possible to be made read-lockless as well.
The main index file is more problematic though. It contains a lot of changing fields in the header, such as number of messages, number of seen messages, etc. I can't think of any way to make it safe to read these fields without locks (my previous ideas didn't actually work).
Lockless reads are pretty much a must for scalable NFS-safety. I think it could be done by simply removing all the constantly changing headers. If you want to know how many seen messages there are, just read all the records in the file and count them. Expunging would be done by rewriting the file and rename()ing it over the old file. Appending new messages is a bit tricky to do safely, but I think I know how to do that too..
read/write locks allow the changing headers, but there's locking contention problems with shared mailboxes..
I think I'll make it optional how to do this.
Oh, and I also thought how indexes would work with shared mailboxes. You could use one shared cache file, but each user would have their own main index and modify log. That would allow storing per-user flags in the index file and also expunge (hide) messages by removing them from user's index.