Again only mbox fixes. I found some more bugs which could have also caused some of the mbox problems that people reported.
They were found when I today thought I'd again do a bit of testing with my favourite 1.4GB mbox. Then I thought I might as well see how it compares against UW-IMAP. First a bit of explanations how they work internally:
Dovecot 1.0-test24
Dovecot works by trying to do everything in one "sync" function. It reads new mails, inserts missing headers, writes header modifications and expunges messages.
Dovecot leaves 100 bytes of padding in every mail's headers, which it can use to avoid moving the rest of the file forward when it needs to insert space. However, if there's not enough space (and with new mails there isn't), it needs to do the moving.
Moving is done by first reading messages and counting how many bytes we're short. Once we've seen enough padding we'll rewrite those messages. If we read until end of file, we'll grow the file and rewrite the messages. Rewriting is made backwards, so we don't have to do much buffering, and in case the rewriting gets interrupted (crash, power loss, etc.) the data loss is very small, a few kilobytes maximum.
UW-IMAP 2004
UW-IMAP first only reads the mbox with SELECT, rewrite is only triggered by LOGOUT, EXPUNGE and CHECK commands. UW-IMAP also inserts padding, but less than Dovecot. It tries to get added headers to fit into 50 bytes, and what's not used is left as padding. Normally this seems to get it around 15-20 bytes of padding. Perhaps I should shrink it from Dovecot too.
Rewriting works by reading the file forward into buffer and writing the changes as needed. This is quite fast, but it means the buffer can grow large, and if the rewrite gets interrupted everything in the buffer gets lost. In my test mbox this would have been 23MB of lost data, but normally much less.
Benchmarks
I simply rewrote a 1.4GB mbox containing 361052 mails, Linux kernel mailing list archives from years 96-02. Computer is Athlon XP 2700+ with 1GB of memory.
reads/writes were counted using Linux's iostat command. Nothing else was being used in that partition, so the numbers should be accurate. Except UW-IMAP's read count is a few blocks too much because I got tired of waiting it and started looking into the mbox to see how far it had gotten.
Read counts could also be somewhat wrong if some of the mbox was already in buffer cache. I tried to trash it anyway by catting 2x4GB of data into /dev/null. Kernel used around 900MB of memory for caching.
Total CPU times may also be a bit off, as the computer was being used at the same time.
Dovecot 1.0-test24
reads : 4007432 blocks = 1956 MB writes: 2947381 blocks = 1439 MB
original mbox : 1420611590 B = 2774632 blocks = 1354 MB rewritten mbox: 1472684487 B = 2876336 blocks = 1404 MB indexes : 14452732 B = 28227 blocks = 14 MB
7221164 dovecot.index 10436 dovecot.index.cache 7221132 dovecot.index.log (this will be truncated after a while)
- 16064 VSZ, 8012 RSS after SELECT completed
- 14MB is mmaped index files
- 216kB heap left of which 70kB actually in use
- heap usage was 25MB VSZ/RSS constantly while syncing, but allocations were so large that libc used anonymous mmap()s so they got dropped after sync
- VSZ peaked at 32MB, most likely because index file(s) were temporarily being mmap()ed more than once
- 63.37s user
- 16.56s system
- 24% cpu
- 5:22.75 total
UW-IMAP 2004
Reading:
- memory: 70044 VSZ, 67724 RSS
- CPU: 7s user, 53s total
Totally:
reads : 5549640 blocks = 2709 MB writes: 2875581 blocks = 1404 MB
original mbox : 1420611590 B = 2774632 blocks = 1354 MB rewritten mbox: 1444411190 B = 2821115 blocks = 1377 MB
- 93416 VSZ, 91120 RSS
- 2295.81s user
- 23.99s system
- 96% cpu
- 39:57.90 total
Notes
Not counting Dovecot's indexes UW-IMAP wrote 21MB less. But Dovecot wrote 27MB more padding, so just by shrinking it Dovecot would have written 6MB less data. Also because Dovecot writes the file backwards, it needs to do some extra jumping around and overlapping writes, but I guess OS nicely merged them.
I'm not exactly sure why UW-IMAP uses so much CPU for rewriting, but it does and so Dovecot is over 7x faster in total (with 36x less CPU).
Dovecot should also support delaying the rewrite. This is mostly useful for POP3 clients which deletes all the mail at logout, so they won't need the rewriting at all. Dovecot also writes all flag changes to disk immediately while UW-IMAP leaves it later to do more at once. That results in less total I/O as well.