[Dovecot] New mailbox format
I've been doing some thinking on a new better mailbox format that'd be faster than mbox and maildir for most common uses. Comments welcome.
It won't come to Dovecot v1.0, but I'll probably create another branch to Dovecot CVS where I'll start implementing it and some other possibly destabilizing changes.
Generic
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
This format is mostly designed to work nicely with separate index files such as Dovecot has. The flags are stored in the mailbox files mostly as a backup in case the index gets lost or corrupted.
Index File
Named "index".
Index file gives a mapping between UIDs <-> filenames:
Header: <uidvalidity> <last-used-uid> Record: <uid-first> <uid-last> <filename> Record: ..
eg.:
1127468129 5 1 3 msg.1 4 4 msg.4
All of the UIDs don't have to actually exist in the file if they have been expunged. The record should be removed once all of the UIDs have been removed from it (and the file itself been deleted).
The same file may appear multiple times in the index. For example:
1 2 msg.1 3 3 msg.3 4 5 msg.1
This means that UIDs 1, 2, 4 and 5 are in file msg.1, but UID 3 is in file msg.3.
The filename doesn't matter as long as it's unique, but probably wouldn't hurt to include its first UID in the filename so humans can find mails more easily. The filename however must not end with ".lock".
The file is modified by creating first exclusively owned "index.lock" file, updating it and then rename()ing it over the index file. This lock file shouldn't be held for a long time, so if it exists, and it doesn't get modified at all within 30 seconds, it can be overwritten. Before rename()ing over the index file, you should check with stat() that the index.lock file is still the same as you expect it to be (in case your process was temporarily stopped for over 30 secs).
The index file's mtime timestamp must grow after each modification. This makes it easy to check if new mail has been added to the mailbox.
Mail File
File header:
Bytes Format Description 8 hex Header size 16 hex Append offset 4 hex Mail header size 4 hex Mail header padding 4 hex Keyword count in mail headers
Mail header:
Bytes Format Description 8 hex UID 16 hex Mail size 1 1/0 Answered flag 1 1/0 Flagged flag 1 1/0 Deleted flag 1 1/0 Seen flag 1 1/0 Draft flag 1 1/0 Expunged flag n 1/0 Keywords
The file layout then goes:
<File header>
- possible padding due to Header size <Mail header #1>
- possible padding due to Mail header size <Mail data #1>
- possible padding due to Mail header padding (ie. next Mail header's offset is divisible with Mail header padding) <Mail header #2> etc..
The UIDs must be ascending within the file (but as said above, some UIDs between them may exist in other files).
When the file is opened, it must be shared locked with flock(). Expunging mails requires exclusive flock(), so expunges can't happen while someone is still reading the file. Mails can however be marked with expunged-flag, and such mails should be treated as if they are already expunged. Expunged flag must not be removed. If the mail for some reason is wanted to be un-expunged, it must be copied as a new mail with new UID to the mailbox.
Modifying flags doesn't require any locks. You'll simply write over those flags you wish to change. That's why each flag takes one byte.
Safest way to expunge is to copy all non-expunged mails after the first expunged one to new file. Then update the index file with the new location for the UIDs, and at last ftruncate() at the position of first expunged mail in the original file.
Alternative way to expunge is to just move the data over the expunged mails and fruncate() at the end. No index file updates necessary. This is more filesystem quota friendly, but if the expunging gets aborted in the middle it will corrupt the mails.
Appending mails can be done by either locking existing file with a dotlock and appending to it, or creating a new file. If appending to existing file, also flock() it so expunging won't happen in the middle of append. If appending to new file, create it with .lock suffix and then check that the file without .lock wasn't already just created.
When appending to existing file, check "Append offset" from file's header. If that offset is at the end of file, just append there. Otherwise read the mail's header from there. If it has a non-zero UID and index file says that the same UID exists in this file, jump over the mail and do the same checks for all the rest of the mails in the file. Once the check doesn't match or you reach EOF, update the append offset in the header to that offset and write your mail there.
Initially write the UID in mail's header as zeroes ('0', not NUL). Once the append is completely finished, you'll have to update it with the new UID. You'll do this by locking index file, reading the latest last-used-uid and using the next value for your UID. Once you've written it, rewrite the index file with updated last-used-uid field and your UID added to it. After this, update the append offset in the mail file and unlock it.
Keywords file
Named "index.keywords". This contains mappings between keyword numbers and names. TODO: may be problematic with lockless flag writes..
Recent file
"index.recent.<uid>" file tells what the last non-recent UID was. The file doesn't have any data inside it.
When opening the mailbox you should find and rename() the existing index.recent.<uid> file to the new index.recent.<last-uid>. A successful rename() means the mails between <uid>+1 .. <last-uid> are recent for you. A failed rename() means that someone else managed to do it. You should look for the file again and if it was renamed to less than <last-uid>, you should try again.
Nightly cleanups
Because expunged mails may be left lying around for a while, there probably should be some nightly check that cleans them up. Other cleanups that may be necessary after crashes are:
- .lock files
- expunged data at the end of files (can be checked with same rules as how appending determines the offset where to append)
One problem with nightly expunging is how to expunge mails if the file is still locked by some process. Forced shutdowns may be annoying to users and can be difficult to implement..
Alternative locking strategy
The locking strategy as described above is optimized for high concurrency: Only if mail file is in the middle of being expunged, we're waiting on locks. With appending we may be waiting for existing appends to finish if we don't wish to create a new file for some reason. Other than those, reading and flag updating are completely lockless.
The two downsides to this are:
flock() isn't available everywhere. It could however be replaced with fcntl() if that's available. NFS may not have either.
Expunges can't happen while multiple processes are accessing the mailbox at a same time. This may be a problem if the mailbox never has less than two processes accessing it.
The alternative to flock()ing would be to simply handle unexpected end-of-file reads as a sign of expunge, which would require re-reading index file to see where the mails went (or if they were really expunged). This has two problems however. One is that flag writes couldn't catch expunges without locking, and another is that if you've already sent half a mail to IMAP client, you have no other choice than to disconnect the client if you can't send the rest of the mail.
It might be worth it optionally. Especially for NFS people. Then you'd basically have the dotlock file that locks appends, expunges and flag updates. Reading would still be lockless.
Also for the best performance the mail flags could be stored only in Dovecot's index files. In that case there's no need to update flags and the only downside with this locking strategy would be the potential disconnections, which probably aren't all that bad.
Compatibility
If needed, it would be possible to create new/ and tmp/ directories under the mailbox to allow mail deliveries to be in maildir format. The maildir files would be then either just moved or maybe be appended to existing mail files.
Timo Sirainen wrote:
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
This seems like a lot of complexity for an unknown amount of performance. Sure, it is going to be loads faster than multi-megabyte mbox mailboxes, but you can color me unconvinced that this will be a significant win over maildir. The primary advantage to maildir is the utter simplicity of all operations; at no time do you need to completely rewrite any files and all operations are 100% atomic. The index format under maildir is also very simple, since you only need to keep track of the filename (and flags) rather than filename and offset and flags. And with modern filesystems, disk access is intelligently cached.
If you are trying to tune for where there are significant numbers of very small (< 2k) files (well smaller than the typical block size in the underlying filesystem), you may be aiming too small. It looks like the median file size in my maildir folders is about 3100 bytes. What sizes were you thinking the typical admin would set as the limit?
Personally, I think your time would be better spent integrating a database message store and let the database engine deal with storage and indexing issues. YMMV. ;-)
John
-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748
On Fri, 2005-09-23 at 10:01 -0400, John Peacock wrote:
Timo Sirainen wrote:
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
This seems like a lot of complexity for an unknown amount of performance. Sure, it is going to be loads faster than multi-megabyte mbox mailboxes, but you can color me unconvinced that this will be a significant win over maildir. The primary advantage to maildir is the utter simplicity of all operations; at no time do you need to completely rewrite any files and all operations are 100% atomic. The index format under maildir is also very simple, since you only need to keep track of the filename (and flags) rather than filename and offset and flags. And with modern filesystems, disk access is intelligently cached.
I think it'd need some benchmarking :) It depends quite a lot on filesystem, but opening and reading a single file is still a lot faster in all filesystems compared to opening and reading thousands of files. That's probably not a common operation for IMAP clients, but with POP3 the behavior is often to just read all new mails and delete the existing ones.
Besides just raw throughput the new format would allow higher concurrency with multiple clients reading/writing the mailbox. While maildir theoretically doesn't have any locks, in practise it needs them with all existing filesystems or mails gets temporarily lost and Dovecot starts giving errors.
While the maildir format itself is simple, it's actually really difficult to handle correctly when the maildir is changing under us. Files can be renamed at any time so you'll have to be prepared to look for the file's new name at any time. Filesystems also don't work the way maildir assumes they do, so you have to work around their limitations too.
There probably are also other reasons why people don't like maildir, which I don't really remember now.
If you are trying to tune for where there are significant numbers of very small (< 2k) files (well smaller than the typical block size in the underlying filesystem), you may be aiming too small. It looks like the median file size in my maildir folders is about 3100 bytes. What sizes were you thinking the typical admin would set as the limit?
I thought a few megabytes per file might be good. Large enough for full mailbox reads to be fast but not so large that expunging messages from the middle would cause too much I/O.
This could possibly be also automatically set per mailbox. If user always expunges all mails at a time (POP3) or never expunges (mailing list archives), there would be only a single file.
Personally, I think your time would be better spent integrating a database message store and let the database engine deal with storage and indexing issues. YMMV. ;-)
I think SQL database as a mail store would have much worse performance than with any filesystem based mail store.
Anyway, I wouldn't mind having Dovecot support SQL databases (or other kind of databases). If you really want it, you can always pay for it to get implemented on my work time, which is what's happening with this mailbox format :)
Timo Sirainen wrote:
While the maildir format itself is simple, it's actually really difficult to handle correctly when the maildir is changing under us. Files can be renamed at any time so you'll have to be prepared to look for the file's new name at any time. Filesystems also don't work the way maildir assumes they do, so you have to work around their limitations too.
I thought that was the point of dovecot-lda, so that you know that dovecot is the only process touching the message files and can update the indexes as things change, rather than after the fact. Too bad you cannot write a portable FS hook that would fire when the directory changed.
This could possibly be also automatically set per mailbox. If user always expunges all mails at a time (POP3) or never expunges (mailing list archives), there would be only a single file.
If you are thinking that way, it might be useful to mark mailboxes as either:
a) sequential access (most POP3 and archives);
b) random access (IMAP).
and provide different tuning based on usage expectations, rather than trying to actively tune. Then the system admin could choose which fits their needs best; we only have a few people using POP3 and everyone else is IMAP, and I frankly don't care what kind of performance the POP3 people have.
I think SQL database as a mail store would have much worse performance than with any filesystem based mail store.
Except that with clustering, you could have multiple DB servers getting hit by multiple dovecot servers. SQL could potentially scale to much larger environments than filesystem support. Plus, think of all the neat fulltext indexing possibilities.
I suspect GMail is based on SQL... ;-)
John
-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748
On 23.9.2005, at 18:04, John Peacock wrote:
Timo Sirainen wrote:
While the maildir format itself is simple, it's actually really difficult to handle correctly when the maildir is changing under us. Files can be renamed at any time so you'll have to be prepared to look for the file's new name at any time. Filesystems also don't work the way maildir assumes they do, so you have to work around their limitations too.
I thought that was the point of dovecot-lda, so that you know that dovecot is the only process touching the message files and can update the indexes as things change, rather than after the fact.
Dovecot still can't know that it's the only one, but I had been thinking about adding an option to make Dovecot assume it's the only program modifying the maildir. That should avoid scanning the cur/ directory constantly. Except that probably still requires some changes since up-to-date maildir filenames aren't stored anywhere, but maybe they could be guessed by looking at the base filename and mail's current flags from indexes and generating the most-likely filename based on them (or did I already have that code..? :)
This could possibly be also automatically set per mailbox. If user always expunges all mails at a time (POP3) or never expunges (mailing list archives), there would be only a single file.
If you are thinking that way, it might be useful to mark mailboxes as either:
a) sequential access (most POP3 and archives); b) random access (IMAP).
and provide different tuning based on usage expectations, rather than trying to actively tune. Then the system admin could choose which fits their needs best; we only have a few people using POP3 and everyone else is IMAP, and I frankly don't care what kind of performance the POP3 people have.
I haven't really thought about this yet, but a global setting wouldn't usually be enough for all users and I'm not sure how it can be set per-user or per-mailbox in any useful way.
I think SQL database as a mail store would have much worse performance than with any filesystem based mail store.
Except that with clustering, you could have multiple DB servers getting hit by multiple dovecot servers. SQL could potentially scale to much larger environments than filesystem support.
How did you think it would work?
One way to do it would be to distribute users to different SQL servers. That's just as possible with filesystems.
If you thought about some kind of database that is distributed across multiple computers.. Well, that would probably work, but I haven't heard much of them being in any actual use. Probably because they all cost too much. I don't think PostgreSQL or MySQL had support for these kind of clusters yet?
There are also clustered filesystems that are distributed across multiple computers. Dovecot already works with them.
Plus, think of all the neat fulltext indexing possibilities.
Unfortunately that doesn't help with IMAP at all, and might make it actually more difficult and slower to implement the kind of indexing that would benefit IMAP's SEARCH command.
I don't think PostgreSQL or MySQL had support for these kind of clusters yet?
I don't run mysql clusters myself, however:
http://www.mysql.com/products/database/cluster
MySQL Cluster enables you to:
* Cost-effectively deliver 5 nines availability using parallel
server architecture with no single point of failure.
* Deliver the performance and high throughput required to meet the
most demanding enterprise applications.
* Incrementally scale your applications in a linear fashion as
your needs grow without having to invest in expensive hardware.
Brian
On 23.9.2005, at 23:46, Brian Bartholomew wrote:
I don't think PostgreSQL or MySQL had support for these kind of clusters yet?
I don't run mysql clusters myself, however:
Doesn't look like it's usable for storing emails yet. From FAQ:
How much RAM do I need? Is it possible to use disk memory at all?
Currently, Cluster is in-memory only. This means that all table data (including indexes) is stored in RAM. Therefore, if your data takes up 1 gigabyte of space and you wish to replicate it once in the cluster, you need 2 gigabytes of memory to do so. This in addition to the memory required by the operating system and any applications running on the cluster computers.
On Fri, 2005-09-23 at 19:18 +0300, Timo Sirainen wrote:
I thought that was the point of dovecot-lda, so that you know that dovecot is the only process touching the message files and can update the indexes as things change, rather than after the fact.
Dovecot still can't know that it's the only one, but I had been thinking about adding an option to make Dovecot assume it's the only program modifying the maildir. That should avoid scanning the cur/ directory constantly. Except that probably still requires some changes since up-to-date maildir filenames aren't stored anywhere, but maybe they could be guessed by looking at the base filename and mail's current flags from indexes and generating the most-likely filename based on them (or did I already have that code..? :)
If dovecot uses both ends of delivery (in and out) of the message store, you should modify it.
Maildir is a great system that solves very real problems (that most of its opponents seem to believe don't exist).
That doesn't change the fact that it's not terribly easy to model IMAP on it.
I'd recommend a slightly modified maildir- incompatible, but easy to mutate to/from a compatible maildir as needs change:
DMaildir/tmp DMaildir/use DMaildir/fl
that's it.
use contains the message IDs and flags and serves the same purpose as cur, but there is no new.
To convert to DMaildir, one generates a maildir: marks the home directory as sticky to halt delivery [wait] moves messages from new/ to cur/ renames cur/ to use/ mkdir fl/ unsticks mail delivery.
To convert back: mkdir new
When opening a mailbox:
If new/ exists, use it.
If use/ exists use that.
if cur/ doesn't exist but use/ exists and new/ exists, rename use to
cur.
flags should NOT be encoded into the message (so that messages are never rename()d within use/)
fl/ contains a set of symbolic links. each message in use/ MAY have a symlink with the same name in fl/ containing all of the metadata as the link target (systems without symlinks can use actual files I suppose, but do any of those still exist?)
readlink is cheaper than open/read/close, and we already know the message "identifier".
Really, it might be just as slow as returning stat() to the equation, but because you'd be putting other metadata available to the DMaildir, you could make up for it by making the information ACTUALLY NEEDED more accessible.
This could possibly be also automatically set per mailbox. If user always expunges all mails at a time (POP3) or never expunges (mailing list archives), there would be only a single file.
If you are thinking that way, it might be useful to mark mailboxes as either:
a) sequential access (most POP3 and archives); b) random access (IMAP).
and provide different tuning based on usage expectations, rather than trying to actively tune. Then the system admin could choose which fits their needs best; we only have a few people using POP3 and everyone else is IMAP, and I frankly don't care what kind of performance the POP3 people have.
I haven't really thought about this yet, but a global setting wouldn't usually be enough for all users and I'm not sure how it can be set per-user or per-mailbox in any useful way.
Worse still, it may be per-time. My access behaviors of new messages is largely delete, but once it's a few days old, I don't delete anything anymore.
Many people don't move things out of INBOX... ever.
I think SQL database as a mail store would have much worse performance than with any filesystem based mail store.
Except that with clustering, you could have multiple DB servers getting hit by multiple dovecot servers. SQL could potentially scale to much larger environments than filesystem support.
No it cannot.
The AMOEBA filesystem (bullet) exhibits nearly perfect (linear) scalability.
Transactions require locks, and locks are slow. Better to look for a lock-free distributed mailbox.
Even UID generation can be performed without locks, although only in a very IMAP-specific way.
[[this was covered on the dbmail mailing list a while back]]
I can pull up the references if anyone's interested.
If you thought about some kind of database that is distributed across multiple computers.. Well, that would probably work, but I haven't heard much of them being in any actual use. Probably because they all cost too much. I don't think PostgreSQL or MySQL had support for these kind of clusters yet?
SQL is never faster than domain-specific data structures, and distributed SQL is very slow.
Email is constant. Replicating constant blobs is downright trivial (consider running rsync twice) provided you name them uniquely (easy enough).
Flags and message location however, is mutable, and distributing a directory is hard work.
Nevertheless, I'm a big proponent of split filesystems; make storage one service and metadata (directory) another.
I presently am using Fedora directory server with it's multimaster replication system as a means to have per-site tolerance. Replication is fairly sluggish, but an IMAP user wouldn't need to notice flag/keyword replication on the order of seconds (compare with SQL replication which has been measured in minutes).
There's absolutely no reason LDAP couldn't be used (initially) as a directory for flags and deletion status.
Furthermore, LDAP's search model is compatible with the IMAP search model.
At which point, profiling the beast would answer exactly what could be sped up.
There are also clustered filesystems that are distributed across multiple computers. Dovecot already works with them.
Almost all distributed clustered filesystems exhibit lazy replication. Performance is completely unacceptable if you need UNIX filesystem semantics. These things are even less forgiving than NFS.
The few that don't (GFS) require specialized hardware or infrastructure, and really, are more suitable to sharing a disk array with multiple machines, instead of distributing the content (i.e. local area only).
Plus, think of all the neat fulltext indexing possibilities.
Unfortunately that doesn't help with IMAP at all, and might make it actually more difficult and slower to implement the kind of indexing that would benefit IMAP's SEARCH command.
Agreed.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
On Tue, 2005-10-04 at 10:38 -0400, Geo Carncross wrote:
Even UID generation can be performed without locks, although only in a very IMAP-specific way.
[[this was covered on the dbmail mailing list a while back]]
I can pull up the references if anyone's interested.
Oh? I see this thread: http://mailman.fastxs.net/pipermail/dbmail-dev/2005-May/thread.html#6990 and there you're saying it can't be done? (although I skipped most of the mails in the middle)
On Tue, 2005-12-06 at 18:06 +0200, Timo Sirainen wrote:
On Tue, 2005-10-04 at 10:38 -0400, Geo Carncross wrote:
Even UID generation can be performed without locks, although only in a very IMAP-specific way.
[[this was covered on the dbmail mailing list a while back]]
I can pull up the references if anyone's interested.
Oh? I see this thread: http://mailman.fastxs.net/pipermail/dbmail-dev/2005-May/thread.html#6990 and there you're saying it can't be done? (although I skipped most of the mails in the middle)
It turns out I was corrected later on:
http://mailman.fastxs.net/pipermail/dbmail-dev/2005-July/007204.html
The basic idea works when clients can't connect to both servers at the same time _IF_ the servers cannot connect to each other.
Of course, if a particular CAN connect to each server without them being able to contact each other _OR_ shares it's UID database with another client that does using another channel, then messages could be missing for a little but- but later on, they'll be duplicated.
Here's where I finally started to get it :)
http://mailman.fastxs.net/pipermail/dbmail-dev/2005-July/007231.html
It went on for a little bit while longer while some of the issues got hashed out, collectively there's a coherent protocol in there:
Servers make a distinction between which UIDs they generated, which ones they got from someplace else, and which ones a client has seen.
Each server has to step their UID generation (or get a pool ahead of time) to avoid colliding with eachother- details about this step weren't discussed much- suffice it to say, with a two node cluster, one takes even UIDs and the other odd UIDs.
UIDs are "fixed up" if they can appear out of order after consulting the other nodes in the cluster: If a client has seen a fixed-up UID, it's duplicated to a new (unseed) UID.
Most of the time, servers are interconnected and can verify UID numbers before assigning them to messages- only when that step fails does anything "special" need to be done- and only when that step fails AND SOME clients are still able to reach both nodes (via SMTP or via IMAP) is it possible to need a fixup stage.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
John Peacock wrote:
Timo Sirainen wrote:
I think SQL database as a mail store would have much worse performance than with any filesystem based mail store.
Except that with clustering, you could have multiple DB servers getting hit by multiple dovecot servers. SQL could potentially scale to much larger environments than filesystem support. Plus, think of all the neat fulltext indexing possibilities.
When talking about clustered SQL servers for serious use Oracle RAC is only thing to come on my mind, and to begin with it you need shared disk from SAN with clustering capable filesystem. One could argue that having Oracle as storage instead of plain journaled cluster fs adds cost and complexity for an unknown amount of performance ..
This new mailbox format is going to be on its best in very few and special environments but in those it should be _much_ faster than maildir and more flexible than mbox.
-- Tomi Hakala
On Fri, 23 Sep 2005 11:04:32 -0400, John Peacock <jpeacock@rowman.com> writes:
I suspect GMail is based on SQL... ;-)
As I understand, it is built on GoogleFS and Google's map/reduce algorithm for eliminating duplicated storage. There are also some interesting FOSS hashing implementations optimized for things that smells like message headers.
"Making digital data less copyable is like making water less wet." -- Cory Doctorow
On Fri, 23 Sep 2005, John Peacock wrote:
Timo Sirainen wrote:
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
This seems like a lot of complexity for an unknown amount of performance. Sure, it is going to be loads faster than multi-megabyte mbox mailboxes, but you can color me unconvinced that this will be a significant win over maildir. The primary advantage to maildir is the utter simplicity of all operations; at no time do you need to completely rewrite any files and all operations are 100% atomic. The index format under maildir is also very simple, since you only need to keep track of the filename (and flags) rather than filename and offset and flags. And with modern filesystems, disk access is intelligently cached.
the problem with maildir is that it's fine for a small system but it sucks terrible for large systems in several ways.
the vast number of inodes required is a kernel memory hog. they blow away most backup solutions, and they increase the number of disk seeks required to do anything on the mailbox.
a maildir delivery involves generally at least 4 synchronous writes (and more if your filesystem forces the tmp/ directory changes to be synchronous), compared to a minimum of 2 for mbox.
unless NFS is involved i think maildir is a bad idea... (and i think NFS is a bad idea too, so draw your own conclusions :)
If you are trying to tune for where there are significant numbers of very small (< 2k) files (well smaller than the typical block size in the underlying filesystem), you may be aiming too small. It looks like the median file size in my maildir folders is about 3100 bytes. What sizes were you thinking the typical admin would set as the limit?
the average tends to be even higher when your system includes lots of general users with lots of word/excel/etc. documents attached to their messages. istr a number in the 10KB range when i was working at a place with ~10M mailboxes.
-dean
If you are trying to tune for where there are significant numbers of very small (< 2k) files (well smaller than the typical block size in the underlying filesystem), you may be aiming too small. It looks like the median file size in my maildir folders is about 3100 bytes. What sizes were you thinking the typical admin would set as the limit?
Personally, I think your time would be better spent integrating a database message store and let the database engine deal with storage and indexing issues. YMMV. ;-)
Reiserfs is designed to handle lots of very small files, and my spam-quarantine dir with 26K files performs fine, too. Maildir on Reiser is a great combination for me. Brian
John Peacock wrote:
This seems like a lot of complexity for an unknown amount of performance. Sure, it is going to be loads faster than multi-megabyte mbox mailboxes, but you can color me unconvinced that this will be a significant win over maildir.
It is big win on clustered file systems where Maildir is pretty much worst case scenario.
-- Tomi Hakala
Timo Sirainen wrote:
I've been doing some thinking on a new better mailbox format that'd be faster than mbox and maildir for most common uses. Comments welcome.
Didn't read the details, but sounds good. Only thing is: One reason I like dovecot is the support for an already good format that's support by a lot of software: maildir. The main reason I switched from Courier-IMAP was speed. Dovecot offers a good compromise between speed and standard-compliance (if you'd call maildir a standard). If speed was the most important thing for me, I'd use Cyrus. Other people may like things like dbmail. As long as Dovecot continues supporting the current formats, there's nothing bad with introducing YAMF (TM) (yet another mailbox format), of course. Only that the old ones will probably get less attention. And I'm not sure if a new format will solve all maildir's problems, especially since there are not much in practice (for me: none).
Timo Sirainen wrote:
I've been doing some thinking on a new better mailbox format that'd be faster than mbox and maildir for most common uses. Comments welcome.
Actually an idea i've had in mind for a while is something more akin to PAM, eg an end user definable plugin system for mailbox handling. dovecot etc would just use API calls to plugins. People can write whatever backend they want -- maildir, sql, even gdbm, csv, xml or WEBDAV if they feel like it. And anything could use the modules -- sendmail, postfix, procmail, etc.
-Dan
--On Friday, September 23, 2005 2:29 PM -0700 Dan Hollis <test3943395@anime.net> wrote:
Actually an idea i've had in mind for a while is something more akin to PAM, eg an end user definable plugin system for mailbox handling. dovecot etc would just use API calls to plugins. People can write whatever backend they want -- maildir, sql, even gdbm, csv, xml or WEBDAV if they feel like it. And anything could use the modules -- sendmail, postfix, procmail, etc.
Agreed. Defining an API would be more useful than defining a format. Then one could plug in delivery and access modules with different storage modules.
On 24.9.2005, at 01:18, Kenneth Porter wrote:
--On Friday, September 23, 2005 2:29 PM -0700 Dan Hollis <test3943395@anime.net> wrote:
Actually an idea i've had in mind for a while is something more akin to PAM, eg an end user definable plugin system for mailbox handling. dovecot etc would just use API calls to plugins. People can write whatever backend they want -- maildir, sql, even gdbm, csv, xml or WEBDAV if they feel like it. And anything could use the modules -- sendmail, postfix, procmail, etc.
Agreed. Defining an API would be more useful than defining a format. Then one could plug in delivery and access modules with different storage modules.
The API already exists, that's how mbox and maildir code is implemented separately (src/lib-storage/mail-storage.h). It could be simpler though and mailbox listing should be separated from mailbox accessing.
That's also how I was planning on implementing ACLs, as a plugin that's called before the real storage. That's also how quota plugin already works.
On Sat, 24 Sep 2005, Timo Sirainen wrote:
On 24.9.2005, at 01:18, Kenneth Porter wrote:
--On Friday, September 23, 2005 2:29 PM -0700 Dan Hollis <test3943395@anime.net> wrote:
Actually an idea i've had in mind for a while is something more akin to PAM, eg an end user definable plugin system for mailbox handling. dovecot etc would just use API calls to plugins. People can write whatever backend they want -- maildir, sql, even gdbm, csv, xml or WEBDAV if they feel like it. And anything could use the modules -- sendmail, postfix, procmail, etc. Agreed. Defining an API would be more useful than defining a format. Then one could plug in delivery and access modules with different storage modules. The API already exists, that's how mbox and maildir code is implemented separately (src/lib-storage/mail-storage.h). It could be simpler though and mailbox listing should be separated from mailbox accessing. That's also how I was planning on implementing ACLs, as a plugin that's called before the real storage. That's also how quota plugin already works.
I was suggesting implementing it in a global system-wide generic way, the way PAM is done. So it would become a project in and of itself, independent of dovecot. PAM is independent of anything else, but is used by dovecot, apache, radius, samba and a zillion other applications. It would be nice if a generic mailbox plugin system could be created in the same way.
PMM - pluggable mail modules? /etc/pmm.d/ and /lib/mail/ anyone? :)
-Dan
Dan Hollis wrote:
Actually an idea i've had in mind for a while is something more akin to PAM, eg an end user definable plugin system for mailbox handling.
I was suggesting implementing it in a global system-wide generic way, the way PAM is done. So it would become a project in and of itself, independent of dovecot.
I don't know if it gained much usage in projects outside the University of Washington, but I believe both Pine (MUA) and the UWash IMAP server made use of the c-client library for storage. The c-client library was treated as a somewhat independent project (it had its own mailing list; archives at: ftp://ftp.cac.washington.edu/mail/).
-Tom
On Fri, 2005-09-23 at 15:37 -0700, Dan Hollis wrote:
The API already exists, that's how mbox and maildir code is implemented separately (src/lib-storage/mail-storage.h). It could be simpler though and mailbox listing should be separated from mailbox accessing. That's also how I was planning on implementing ACLs, as a plugin that's called before the real storage. That's also how quota plugin already works.
I was suggesting implementing it in a global system-wide generic way, the way PAM is done. So it would become a project in and of itself, independent of dovecot.
Well .. I'm pretty sceptic that it would become commonly used. Everyone seems to want to handle mailboxes in different ways.
PAM is independent of anything else, but is used by dovecot, apache, radius, samba and a zillion other applications. It would be nice if a generic mailbox plugin system could be created in the same way.
PMM - pluggable mail modules? /etc/pmm.d/ and /lib/mail/ anyone? :)
PAM is really simple though. PMM would most likely require somewhat large API and also a separate library using that API to actually do things.
I wouldn't mind seeing suggestions of such an API, but I don't really want to waste time on it myself :)
On Fri, 23 Sep 2005, Timo Sirainen wrote:
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
for about a decade now i've set up all my inbound mail to deliver to two mboxes -- one is an "inbox", the other is an "archive". the inbox is what i look at with my mta, and i delete things from it as soon as i'm done reading or dealing with them. the archive is there so i never have to think about whether i want to save something (and the disk space is totally manageable)... similar to how gmail works.
my "archive" is a collection of mbox files. one named "current" which is where new deliveries occur, and the others named YYYYMMDD.bz2, which are compressed/read-only archived mboxes. the rotation/compression occurs in a cronjob depending on the size of the current file.
it's a bit of a kludge, because the file boundaries are very obvious if you need to find a thread that's spread across a few of them. but it's all just mbox so it's easy to grep and concat a few files into a temporary mbox and extract a thread with any MUA.
i've wanted to turn this into a "real" format supported by dovecot for a while but i never seem to get to it... it sounds like you're headed in a similar direction.
This format is mostly designed to work nicely with separate index files such as Dovecot has. The flags are stored in the mailbox files mostly as a backup in case the index gets lost or corrupted.
here's one point where my thinking has differed -- i'd treat the mailbox files as read-only (plus one file which is append-only) and include an append-only modification log for recovery purposes... read-only mailbox files permit compression, and don't have the nightly cleanup lockout problems you mention later.
the log is affordable in terms of disk space. the mailbox files plus the log are sufficient to recover the current state of the mailbox.
in my case i'd probably grow the log forever... because i'd also never be doing expunges... but the process of doing an expunge can also update any info embedded into the (compressed|uncompressed) mailbox files.
the cost of a log in terms of disk writes for updates is probably better than updates to the mailbox files themselves -- assuming a log entry is compact enough that dozens fit in a 4KiB filesystem block you can amortize several updates into one synchronous disk write rather than having dozens of synchronous writes to separate blocks.
When the file is opened, it must be shared locked with flock(). Expunging mails requires exclusive flock(), so expunges can't happen while someone is still reading the file.
with read-only mailbox files there's no sharing restriction -- an expunge can create a new mbox file, update the index, and unlink the old one. (minor easy to solve race if a reader gets a filename from the index which is renamed before it's opened... just loop... expunges should be infrequent enough this has no livelock potential.)
but i guess that's not very quota friendly... oops.
Compatibility
If needed, it would be possible to create new/ and tmp/ directories under the mailbox to allow mail deliveries to be in maildir format. The maildir files would be then either just moved or maybe be appended to existing mail files.
yeah i've definitely wanted delivery to require no changes -- in my case it just happens to the mbox file named "current".
i think it's best not to deal with indices and other fancy things during delivery because there's no opportunity to amortize synchronized disk writes... and i know in the case of large ISP mail sites there tend to be a lot of users who never read their mail. you'll support more users on less hardware if the synchronous disk writes are at a minimum.
-dean
oh also... somethign else i've thought about a lot in relation to the
"archive format" i described are "virtual" folders: a virtual folder is a
subset of the index which is presented via IMAP as a "real" folder.
(there was a DEC research mail system which did this ages ago, i don't
remember its name now.)
there are many uses for this:
a tool could perform a sophisticated search that's not supported by IMAP and present it as a virtual folder so that any IMAP MUA can see the results.
for an archive containing 100Ks of messages you could create date subrange views again so that any IMAP MUA can read it without performance difficulties. consider the application of this to public mailing list archives: an archive mailbox presented via anon read-only IMAP so that you can browse it directly with your favourite MUA (inluding webmails which talk IMAP for a google indexable web interface).
there's technically only the need for a single index and collection of mailboxes (for each user) -- delivery filtering tools (i.e. procmail/maildrop) could place the message into one of several virtual folders which the user treats as inboxes.
-dean
On Fri, 23 Sep 2005, dean gaudet wrote:
here's one point where my thinking has differed -- i'd treat the mailbox files as read-only (plus one file which is append-only) and include an append-only modification log for recovery purposes... read-only mailbox files permit compression,
Though they require sequential reading order for parsing, so think about reading a bunch of messages from the end of the mbox: one full decompression for indexing, then very close to full decompression for every message retrieval in the batch. You'd think that retrieving a sequential block via IMAP might help, but a lot of MUAs prefer single message random access.
To address the situation you want, readonly archival, my vision would be a compressed maildir (or equivalent), using each mail as a separately compressed entry. Zip is a pretty good format for this purpose. Though per-file compression is typically only about half as efficient as whole-mbox compression, you'd have much faster search and retrieval if the file entries had their own compression dictionaries.
-- -- Todd Vierling <tv@duh.org> <tv@pobox.com> <todd@vierling.name>
On Fri, 23 Sep 2005, Todd Vierling wrote:
On Fri, 23 Sep 2005, dean gaudet wrote:
here's one point where my thinking has differed -- i'd treat the mailbox files as read-only (plus one file which is append-only) and include an append-only modification log for recovery purposes... read-only mailbox files permit compression,
Though they require sequential reading order for parsing, so think about reading a bunch of messages from the end of the mbox: one full decompression for indexing, then very close to full decompression for every message retrieval in the batch. You'd think that retrieving a sequential block via IMAP might help, but a lot of MUAs prefer single message random access.
the amount of data per compressed file is completely tunable -- in my case my cron job only compresses when the "current" mbox hits 16MiB.
To address the situation you want, readonly archival, my vision would be a compressed maildir (or equivalent), using each mail as a separately compressed entry. Zip is a pretty good format for this purpose. Though per-file compression is typically only about half as efficient as whole-mbox compression, you'd have much faster search and retrieval if the file entries had their own compression dictionaries.
so basically it's a classic tradeoff: speed vs. space... if you design with mbox instead of maildir then you get to decide where to set the tradeoff... whereas if you use compressed maildir you've given up on space immediately. (and i'm not so convinced you get any speed, because i find maildirs with 100000 entries to be a total dog.)
-dean
dean gaudet wrote:
On Fri, 23 Sep 2005, Todd Vierling wrote:
On Fri, 23 Sep 2005, dean gaudet wrote:
here's one point where my thinking has differed -- i'd treat the mailbox files as read-only (plus one file which is append-only) and include an append-only modification log for recovery purposes... read-only mailbox files permit compression,
Though they require sequential reading order for parsing, so think about reading a bunch of messages from the end of the mbox: one full decompression for indexing, then very close to full decompression for every message retrieval in the batch. You'd think that retrieving a sequential block via IMAP might help, but a lot of MUAs prefer single message random access.
the amount of data per compressed file is completely tunable -- in my case my cron job only compresses when the "current" mbox hits 16MiB.
To address the situation you want, readonly archival, my vision would be a compressed maildir (or equivalent), using each mail as a separately compressed entry. Zip is a pretty good format for this purpose. Though per-file compression is typically only about half as efficient as whole-mbox compression, you'd have much faster search and retrieval if the file entries had their own compression dictionaries.
so basically it's a classic tradeoff: speed vs. space... if you design with mbox instead of maildir then you get to decide where to set the tradeoff... whereas if you use compressed maildir you've given up on space immediately. (and i'm not so convinced you get any speed, because i find maildirs with 100000 entries to be a total dog.)
-dean
Just curious - have any of you tried the ReiserFS file system with Maildir? I would think that it should make maildir a lot faster and more efficient.
-- Marc Perkel - marc@perkel.com
Spam Filter: http://www.junkemailfilter.com My Blog: http://marc.perkel.com
On Fri, 23 Sep 2005, Marc Perkel wrote:
Just curious - have any of you tried the ReiserFS file system with Maildir? I would think that it should make maildir a lot faster and more efficient.
i think reiser has a lot of cool ideas, but i haven't tried it.
even with reiser you need to move away from the POSIX API to make a huge
dent in the scalability of maildirs... you need to start amortizing
multiple file operations per system call for one thing. (and there is a
sys_reiser or something proposed for this, but i haven't looked at it).
but that still doesn't help all the dcache and inode cache overhead in the
kernel... and backup systems will still continue to blow chunks when
presented with 100Ms of inodes...
-dean
dean gaudet wrote:
On Fri, 23 Sep 2005, Marc Perkel wrote:
Just curious - have any of you tried the ReiserFS file system with Maildir? I would think that it should make maildir a lot faster and more efficient.
i think reiser has a lot of cool ideas, but i haven't tried it.
even with reiser you need to move away from the POSIX API to make a huge dent in the scalability of maildirs... you need to start amortizing multiple file operations per system call for one thing. (and there is a sys_reiser or something proposed for this, but i haven't looked at it).
but that still doesn't help all the dcache and inode cache overhead in the kernel... and backup systems will still continue to blow chunks when presented with 100Ms of inodes...-dean
Reiser doesn't use INodes and appears to have infinite inodes. I've never used maildir but if I did I'd want to use Reiser on it because that's where Reiser is supposed to really gain.
-- Marc Perkel - marc@perkel.com
Spam Filter: http://www.junkemailfilter.com My Blog: http://marc.perkel.com
On Fri, 23 Sep 2005, Marc Perkel wrote:
Reiser doesn't use INodes and appears to have infinite inodes. I've never used maildir but if I did I'd want to use Reiser on it because that's where Reiser is supposed to really gain.
i'm referring to the inodes in the kernel's memory -- not on disk... even reiser has an inode in memory otherwise it can't present the POSIX file api.
-dean
dean gaudet wrote:
On Fri, 23 Sep 2005, Marc Perkel wrote:
Reiser doesn't use INodes and appears to have infinite inodes. I've never used maildir but if I did I'd want to use Reiser on it because that's where Reiser is supposed to really gain.
i'm referring to the inodes in the kernel's memory -- not on disk... even reiser has an inode in memory otherwise it can't present the POSIX file api.
-dean
But are you sure Reiser processes (fake) inodes the same way?
-- Marc Perkel - marc@perkel.com
Spam Filter: http://www.junkemailfilter.com My Blog: http://marc.perkel.com
On Fri, 23 Sep 2005, dean gaudet wrote:
On Fri, 23 Sep 2005, Marc Perkel wrote:
Reiser doesn't use INodes and appears to have infinite inodes. I've never used maildir but if I did I'd want to use Reiser on it because that's where Reiser is supposed to really gain. i'm referring to the inodes in the kernel's memory -- not on disk... even reiser has an inode in memory otherwise it can't present the POSIX file api.
inodes only use kernel memory when the file is accessed. in this respect there is no difference in the memory usage between reiserfs or ext3 or any other filesystem. think about the case of network filesystems for example.
-Dan
On Fri, 23 Sep 2005, Dan Hollis wrote:
On Fri, 23 Sep 2005, dean gaudet wrote:
On Fri, 23 Sep 2005, Marc Perkel wrote:
Reiser doesn't use INodes and appears to have infinite inodes. I've never used maildir but if I did I'd want to use Reiser on it because that's where Reiser is supposed to really gain. i'm referring to the inodes in the kernel's memory -- not on disk... even reiser has an inode in memory otherwise it can't present the POSIX file api.
inodes only use kernel memory when the file is accessed. in this respect there is no difference in the memory usage between reiserfs or ext3 or any other filesystem. think about the case of network filesystems for example.
i think we're agreeing aren't we?
the only reason i care about kernel memory usage is that once you start scaling to 1000s of simultaneous users kernel memory starts becoming scarce... with maildir it happens a lot faster than with mailbox.
for small systems it just doesn't matter :)
-dean
On Fri, 23 Sep 2005, dean gaudet wrote:
inodes only use kernel memory when the file is accessed. in this respect there is no difference in the memory usage between reiserfs or ext3 or any other filesystem. think about the case of network filesystems for example. i think we're agreeing aren't we?
On Fri, 23 Sep 2005, Dan Hollis wrote: the only reason i care about kernel memory usage is that once you start scaling to 1000s of simultaneous users kernel memory starts becoming scarce... with maildir it happens a lot faster than with mailbox.
Not really. I dont think the kernel memory used by open inodes is anywhere close to a concern compared to the other kernel resources that processes use. So I would suggest that it doesnt happen a lot faster with maildir than it does with mailbox. With 1000s of simultaneous users, the kernel resources used by open inodes would be the least of your worries and would be lost in the background noise anyway.
-Dan
inodes only use kernel memory when the file is accessed. in this respect there is no difference in the memory usage between reiserfs or ext3 or any other filesystem. think about the case of network filesystems for example.
i think we're agreeing aren't we?
the only reason i care about kernel memory usage is that once you start scaling to 1000s of simultaneous users kernel memory starts becoming scarce... with maildir it happens a lot faster than with mailbox.
Anyone that scales to thousands of simultaneous users should be able to afford a few extra servers and has a server architecture where adding servers is easy.
Cor
On Sat, 24 Sep 2005, Cor Bosman wrote:
inodes only use kernel memory when the file is accessed. in this respect there is no difference in the memory usage between reiserfs or ext3 or any other filesystem. think about the case of network filesystems for example.
i think we're agreeing aren't we?
the only reason i care about kernel memory usage is that once you start scaling to 1000s of simultaneous users kernel memory starts becoming scarce... with maildir it happens a lot faster than with mailbox.
Anyone that scales to thousands of simultaneous users should be able to afford a few extra servers and has a server architecture where adding servers is easy.
in practice it's more than a "few extra".
-dean
Reiser doesn't use INodes and appears to have infinite inodes. I've never used maildir but if I did I'd want to use Reiser on it because that's where Reiser is supposed to really gain.
My Sun 420R bought in 2001 just took six seconds for find | wc to tell me I had half a million files and dirs in all my Maildirs.
| backup systems will still continue to blow chunks when presented | with 100Ms of inodes...
We use IBM's Tivoli Storage Manager (TSM). It's probably only cost-effective for large shops, but it works great.
Brian
On Fri, 23 Sep 2005, dean gaudet wrote:
On Fri, 23 Sep 2005, Marc Perkel wrote:
Just curious - have any of you tried the ReiserFS file system with Maildir? I would think that it should make maildir a lot faster and more efficient. i think reiser has a lot of cool ideas, but i haven't tried it. even with reiser you need to move away from the POSIX API to make a huge dent in the scalability of maildirs... you need to start amortizing multiple file operations per system call for one thing.
sql?
-Dan
Marc Perkel wrote:
Just curious - have any of you tried the ReiserFS file system with Maildir? I would think that it should make maildir a lot faster and more efficient.
The proposed mailbox format does seem overly complicated and like a problem that would be better solved at the file system layer.
Lots of performance problems come down to making a decision about how best to use some underlying resource, and often the answer ends up being added complication in the application layer, but usually that's because there isn't a better solution available in a lower layer.
In this case, I'd recommend the client funding this development first give file system optimization a shot before paying for the development of a bunch of code that only serves the purpose of compensating for file system deficiencies.
-Tom
Marc Perkel wrote:
Just curious - have any of you tried the ReiserFS file system with Maildir? I would think that it should make maildir a lot faster and more efficient.
There was a discussion about this some months back, which also covered XFS, and the HASH/B-Tree options of EXT3. Probably worth searching the archives for.
-- Curtis Maloney cmaloney@cardgate.net
On Fri, 23 Sep 2005, dean gaudet wrote:
To address the situation you want, readonly archival, my vision would be a compressed maildir (or equivalent), using each mail as a separately compressed entry. Zip is a pretty good format for this purpose. Though per-file compression is typically only about half as efficient as whole-mbox compression, you'd have much faster search and retrieval if the file entries had their own compression dictionaries.
so basically it's a classic tradeoff: speed vs. space... if you design with mbox instead of maildir then you get to decide where to set the tradeoff... whereas if you use compressed maildir you've given up on space immediately. (and i'm not so convinced you get any speed, because i find maildirs with 100000 entries to be a total dog.)
By a Zip-compressed maildir, I'm referring to a zipfile with each message as a *zip entry*, not a physical file. Think "zip up a maildir into a single zipfile" and you'll get very close to the concept I'm trying to describe.
That sort of thing is much faster at entry retrieval than a monolithic stream-compressed container.
-- -- Todd Vierling <tv@duh.org> <tv@pobox.com> <todd@vierling.name>
I've been doing some thinking on a new better mailbox format that'd be faster than mbox and maildir for most common uses. Comments welcome.
[snip]
Apologies for being rude, but I've not heard complaints about maildir being slow. Is it? It would be nice to see benchmarks.
It seems like some kind of cache manager would solve the problem, if there was one, but then some might say we'd be doing the job of the kernel. Are there kernel parameters which can be tweaked to address the slowness?
On Sat, 24 Sep 2005, nodata wrote:
Apologies for being rude, but I've not heard complaints about maildir being slow. Is it? It would be nice to see benchmarks.
It seems like some kind of cache manager would solve the problem, if there was one, but then some might say we'd be doing the job of the kernel. Are there kernel parameters which can be tweaked to address the slowness?
It mostly has to do with the filesystem being used. The System V based extended filesystems (which most people know as "ext2" and "ext3") have the same history as the BSD filesystem ("ufs" or "ffs"), and neither is particularly efficient at handling a directory of more than 10k files or so. And it's not usually the lookup of a single file that's a problem; it's listing the directory (when doing index synchronization) that will be slow.
Other options now exist that improve this quite a bit. On Linux, there's reiserfs; on BSD, there's lfs, and some enhancements to ffs as part of the "FFSv2" project.
-- -- Todd Vierling <tv@duh.org> <tv@pobox.com> <todd@vierling.name>
Todd Vierling wrote:
It mostly has to do with the filesystem being used. The System V based extended filesystems (which most people know as "ext2" and "ext3") have the same history as the BSD filesystem ("ufs" or "ffs"), and neither is particularly efficient at handling a directory of more than 10k files or so. And it's not usually the lookup of a single file that's a problem; it's listing the directory (when doing index synchronization) that will be slow.
ext3 directories are hash databases, much faster than linear searches/listings.
Other options now exist that improve this quite a bit. On Linux, there's reiserfs; on BSD, there's lfs, and some enhancements to ffs as part of the "FFSv2" project.
Last I heard Reiser was dropped by RedHat. It's still an option on some Linux distros, e.g. CentOS.
On Fri, 23 Sep 2005, Jack Bailey wrote:
ext3 directories are hash databases, much faster than linear searches/listings.
that's true if your tools and kernel are recent enough, and your filesystem has dir_index enabled... see tune2fs -O dir_index, and e2fsck -D ... if you've done upgrade-in-place on your system a few times it's pretty likely your /home is old enough to be unindexed. (make sure filetype is enabled too while you're in there.)
-dean
On Fri, 23 Sep 2005, Jack Bailey wrote:
Todd Vierling wrote:
It mostly has to do with the filesystem being used. The System V based extended filesystems (which most people know as "ext2" and "ext3") have the same history as the BSD filesystem ("ufs" or "ffs"), and neither is particularly efficient at handling a directory of more than 10k files or so. And it's not usually the lookup of a single file that's a problem; it's listing the directory (when doing index synchronization) that will be slow. ext3 directories are hash databases, much faster than linear searches/listings.
not by default. it's a currently very-beta extension to ext3.
Other options now exist that improve this quite a bit. On Linux, there's reiserfs; on BSD, there's lfs, and some enhancements to ffs as part of the "FFSv2" project. Last I heard Reiser was dropped by RedHat. It's still an option on some Linux distros, e.g. CentOS.
reiser was never supported by redhat. can't drop something you never had. redhat bankrolls ext3 development so of course reiserfs is competition. every other distro on the entire planet basically supports reiserfs though. with mandrake it's default for example. redhat doesnt support _anything_ except ext3 actually. xfs and jfs and everything else is "unsupported".
-Dan
It mostly has to do with the filesystem being used. The System V based extended filesystems (which most people know as "ext2" and "ext3") have the same history as the BSD filesystem ("ufs" or "ffs"), and neither is particularly efficient at handling a directory of more than 10k files or so. And it's not usually the lookup of a single file that's a problem; it's listing the directory (when doing index synchronization) that will be slow.
ext3 directories are hash databases, much faster than linear searches/listings.
not by default. it's a currently very-beta extension to ext3.
My understanding is that it's enabled by default for kernels 2.6 and up. I'm running 2.6.9-11.ELsmp, it's ext3, the directories are hash trees, and I did nothing special during the install.
To Dean's point about older stuff, true. There are probably a lot of 2.4 kernels out there.
It mostly has to do with the filesystem being used. The System V based extended filesystems (which most people know as "ext2" and "ext3") have the same history as the BSD filesystem ("ufs" or "ffs"), and neither is particularly efficient at handling a directory of more than 10k files or so.
ufs on FreeBSD is hashed. "options UFS_DIRHASH"
Just get a NetApp and be done with it :)
Cor
Timo Sirainen wrote:
Index File
[...] The file is modified by creating first exclusively owned "index.lock" file, updating it and then rename()ing it over the index file. This lock file shouldn't be held for a long time, so if it exists, and it doesn't get modified at all within 30 seconds, it can be overwritten. Before rename()ing over the index file, you should check with stat() that the index.lock file is still the same as you expect it to be (in case your process was temporarily stopped for over 30 secs).
The locking and handling of the index strikes me as a regression back to the mbox problems that Maildir tried to solve.
Have you considered other approaches, such as having the index be under control of a daemon, and use IPC to communicate events to that daemon, which could then exclusively handle modifying the file?
Or maybe a fixed size record structure, which would permit more of a random record access and modification with less chance of inter process conflicts?
Any way you slice it, though, these are just approximations of a database server. Maybe embedding SQLite (just for indexes) is the answer?
-Tom
Dan Hollis wrote:
Tom Metro wrote:
Tom Metro wrote:
Any way you slice it, though, these are just approximations of a database server. Maybe embedding SQLite (just for indexes) is the answer?
/etc/pmm.d/dovecot /lib/mail/pmm_sql.so ?
What is pmm?
a proposal.
Ah. I wondered if that might have been the case, but it wasn't clear, particularly from the text you quoted.
the idea is a system of pluggable mail modules for mail programs exactly the same way ssh/apache/dovecot/su/login/etc handle authentication via pam.
Sure.
More likely something like /usr/lib/dovecot/*.so initially, until you can convince other projects to buy-in to the API.
-Tom
On Fri, 2005-09-23 at 21:48 -0400, Tom Metro wrote:
Timo Sirainen wrote:
Index File
[...] The file is modified by creating first exclusively owned "index.lock" file, updating it and then rename()ing it over the index file. This lock file shouldn't be held for a long time, so if it exists, and it doesn't get modified at all within 30 seconds, it can be overwritten. Before rename()ing over the index file, you should check with stat() that the index.lock file is still the same as you expect it to be (in case your process was temporarily stopped for over 30 secs).
The locking and handling of the index strikes me as a regression back to the mbox problems that Maildir tried to solve.
Replying a bit late, but anyway..
I don't think the above index.lock is a problem in any way. Actually Dovecot already uses similar method with Maildir's dovecot-uidlist file.
Maildir is lockless only in theory. Unless the maildir is globally locked while checking its contents, files may get temporarily lost with all of the filesystems that I know of.
I think the locking is only a problem if you're holding the lock for a long time. For example if you need to keep the mailbox locked as long as some IMAP client is reading/writing messages, that's bad. That's a problem with mbox, but not with maildir/dbox.
dbox's index.lock file needs to exist only in two situations:
While new message UIDs need to be allocated (as the final step of saving new mail(s) to mailbox). A global lock is needed for this with any kind of mail storage with IMAP, since UIDs must exist and they must always grow.
Writing message flag changes / expunging mails. These changes should go pretty quickly as well.
Have you considered other approaches, such as having the index be under control of a daemon, and use IPC to communicate events to that daemon, which could then exclusively handle modifying the file?
One of the biggest reasons for dbox's existance is that it needs to work well with clustered filesystems. And relying on a daemon running in only one computer kind of defeats cluster's purpose then..
Anyway I'm not sure how that would actually benefit anything. A single process could be a bottleneck if it handled all users' indexes, so it should be able to scale to multiple processes. And to avoid locking in those cases, each process should handle only a group of specific users. And if we're going there, it might as well be the imap process itself that does it all.
Redirecting all imap connections for one user to same imap process wouldn't be too difficult to implement (and it's been in my mind for a while already), but having pop3 and dovecot-lda also in the same process could get more tricky.
But does it really matter if the locking is handled by serialization (by making everything go through a single process) or actual locking? If there's only a single writer, the locking succeeds easily always. If there are multiple writers, you'll need to wait in both cases. Although I suppose serialization provides more fair scheduling.
And even if there was only a single process updating the index file, I'd probably still make it update the index using the exact same rename(index.lock, index) to make sure the file doesn't get corrupted in case of crashes :)
Any way you slice it, though, these are just approximations of a database server. Maybe embedding SQLite (just for indexes) is the answer?
Doesn't look like SQLite's locking (or writing in general) is in any way cheaper than what I'm currently doing:
http://www.sqlite.org/lockingv3.html
Actually it looks like it may even block on getting a shared lock if there are writes happening at the same time. I think many other databases don't block there but instead use rollback file to get the old data. Dovecot's index files and dbox's index files aren't needed to be read-locked at all, so they never block on reading.
SQLite could be useful for storing messages' and mailboxes' metadata (ANNOTATE, ANNOTATEMORE extension) since those require almost a full database to work properly, but I don't think it's a good idea for what Dovecot/dbox currently uses index files for. SQL in general isn't very well suited for them.
Hi Timo, I think it should be a database like format. (keep reading) Since a couple of week ago, I was researching which solution was better, mbox or Maildir. And asking to myself, "is there a better way?"
My ideal solution would be a format with an API so that it is easy to implement in other aplications like procmail, sendmail, etc, etc.
As I see,
- mbox is good because you don't get a fragmented "Folder", is easy to backup and does not eat inodes. But, it is a trouble to find, delete or insert and email in one big file.
- Maildir is faster to access because it have and "index", in the filesystem. But it uses a lot of inodes and reading all emails in a big folder means reading too much files.
So, an ideal solution should be: "an mbox like format, with an index and DB like allocation". The features should be: parts. For example,
- mbox like format: just one big file or a folder separated in x bytes
create a new file every 500MB. This will save inodes and will be easy to backup particular email in
- with an index: it will make really fast to search, and access a
that big file. No need to read 500MB just to get the last email.
- DB like allocation: you don't to worry about deleting the first email in a 2GB mbox file. The space inside this big file will be administered like in a DB. You use constant size blocks inside this big file and your index help you get to your particular email easy. No need for a cronological storage of email in a big file.
Final notes, this is an ideal setup. You could reinvent (a better) the wheel or use tools available like mysql, sleepycat, pgsql, etc. The success of this new mail format will be the easy implementation of the API. It must have a portable library so that you can easilly support this new format in a transparent way in old software like sendmail, procmail, qmail, postfix, etc
Is just an idea I was having, not final, but is more or less what I was thinking.
HTH and thanks for reading, Oliver
Timo Sirainen wrote:
I've been doing some thinking on a new better mailbox format that'd be faster than mbox and maildir for most common uses. Comments welcome.
It won't come to Dovecot v1.0, but I'll probably create another branch to Dovecot CVS where I'll start implementing it and some other possibly destabilizing changes.
Generic
The point is to have a mailbox format where the mailbox can consist of one or more files. Grouping multiple files in a single file makes it faster to read, but it's slower to expunge mails from the beginning of the file. So this format would allow sysadmin to specify rules on how large the files would be allowed to grow.
This format is mostly designed to work nicely with separate index files such as Dovecot has. The flags are stored in the mailbox files mostly as a backup in case the index gets lost or corrupted.
Index File
Named "index".
Index file gives a mapping between UIDs <-> filenames:
Header: <uidvalidity> <last-used-uid> Record: <uid-first> <uid-last> <filename> Record: ..
eg.:
1127468129 5 1 3 msg.1 4 4 msg.4
All of the UIDs don't have to actually exist in the file if they have been expunged. The record should be removed once all of the UIDs have been removed from it (and the file itself been deleted).
The same file may appear multiple times in the index. For example:
1 2 msg.1 3 3 msg.3 4 5 msg.1
This means that UIDs 1, 2, 4 and 5 are in file msg.1, but UID 3 is in file msg.3.
The filename doesn't matter as long as it's unique, but probably wouldn't hurt to include its first UID in the filename so humans can find mails more easily. The filename however must not end with ".lock".
The file is modified by creating first exclusively owned "index.lock" file, updating it and then rename()ing it over the index file. This lock file shouldn't be held for a long time, so if it exists, and it doesn't get modified at all within 30 seconds, it can be overwritten. Before rename()ing over the index file, you should check with stat() that the index.lock file is still the same as you expect it to be (in case your process was temporarily stopped for over 30 secs).
The index file's mtime timestamp must grow after each modification. This makes it easy to check if new mail has been added to the mailbox.
Mail File
File header:
Bytes Format Description 8 hex Header size 16 hex Append offset 4 hex Mail header size 4 hex Mail header padding 4 hex Keyword count in mail headers
Mail header:
Bytes Format Description 8 hex UID 16 hex Mail size 1 1/0 Answered flag 1 1/0 Flagged flag 1 1/0 Deleted flag 1 1/0 Seen flag 1 1/0 Draft flag 1 1/0 Expunged flag n 1/0 Keywords
The file layout then goes:
<File header>
- possible padding due to Header size <Mail header #1>
- possible padding due to Mail header size <Mail data #1>
- possible padding due to Mail header padding (ie. next Mail header's offset is divisible with Mail header padding) <Mail header #2> etc..
The UIDs must be ascending within the file (but as said above, some UIDs between them may exist in other files).
When the file is opened, it must be shared locked with flock(). Expunging mails requires exclusive flock(), so expunges can't happen while someone is still reading the file. Mails can however be marked with expunged-flag, and such mails should be treated as if they are already expunged. Expunged flag must not be removed. If the mail for some reason is wanted to be un-expunged, it must be copied as a new mail with new UID to the mailbox.
Modifying flags doesn't require any locks. You'll simply write over those flags you wish to change. That's why each flag takes one byte.
Safest way to expunge is to copy all non-expunged mails after the first expunged one to new file. Then update the index file with the new location for the UIDs, and at last ftruncate() at the position of first expunged mail in the original file.
Alternative way to expunge is to just move the data over the expunged mails and fruncate() at the end. No index file updates necessary. This is more filesystem quota friendly, but if the expunging gets aborted in the middle it will corrupt the mails.
Appending mails can be done by either locking existing file with a dotlock and appending to it, or creating a new file. If appending to existing file, also flock() it so expunging won't happen in the middle of append. If appending to new file, create it with .lock suffix and then check that the file without .lock wasn't already just created.
When appending to existing file, check "Append offset" from file's header. If that offset is at the end of file, just append there. Otherwise read the mail's header from there. If it has a non-zero UID and index file says that the same UID exists in this file, jump over the mail and do the same checks for all the rest of the mails in the file. Once the check doesn't match or you reach EOF, update the append offset in the header to that offset and write your mail there.
Initially write the UID in mail's header as zeroes ('0', not NUL). Once the append is completely finished, you'll have to update it with the new UID. You'll do this by locking index file, reading the latest last-used-uid and using the next value for your UID. Once you've written it, rewrite the index file with updated last-used-uid field and your UID added to it. After this, update the append offset in the mail file and unlock it.
Keywords file
Named "index.keywords". This contains mappings between keyword numbers and names. TODO: may be problematic with lockless flag writes..
Recent file
"index.recent.<uid>" file tells what the last non-recent UID was. The file doesn't have any data inside it.
When opening the mailbox you should find and rename() the existing index.recent.<uid> file to the new index.recent.<last-uid>. A successful rename() means the mails between <uid>+1 .. <last-uid> are recent for you. A failed rename() means that someone else managed to do it. You should look for the file again and if it was renamed to less than <last-uid>, you should try again.
Nightly cleanups
Because expunged mails may be left lying around for a while, there probably should be some nightly check that cleans them up. Other cleanups that may be necessary after crashes are:
- .lock files
- expunged data at the end of files (can be checked with same rules as how appending determines the offset where to append)
One problem with nightly expunging is how to expunge mails if the file is still locked by some process. Forced shutdowns may be annoying to users and can be difficult to implement..
Alternative locking strategy
The locking strategy as described above is optimized for high concurrency: Only if mail file is in the middle of being expunged, we're waiting on locks. With appending we may be waiting for existing appends to finish if we don't wish to create a new file for some reason. Other than those, reading and flag updating are completely lockless.
The two downsides to this are:
flock() isn't available everywhere. It could however be replaced with fcntl() if that's available. NFS may not have either.
Expunges can't happen while multiple processes are accessing the mailbox at a same time. This may be a problem if the mailbox never has less than two processes accessing it.
The alternative to flock()ing would be to simply handle unexpected end-of-file reads as a sign of expunge, which would require re-reading index file to see where the mails went (or if they were really expunged). This has two problems however. One is that flag writes couldn't catch expunges without locking, and another is that if you've already sent half a mail to IMAP client, you have no other choice than to disconnect the client if you can't send the rest of the mail.
It might be worth it optionally. Especially for NFS people. Then you'd basically have the dotlock file that locks appends, expunges and flag updates. Reading would still be lockless.
Also for the best performance the mail flags could be stored only in Dovecot's index files. In that case there's no need to update flags and the only downside with this locking strategy would be the potential disconnections, which probably aren't all that bad.
Compatibility
If needed, it would be possible to create new/ and tmp/ directories under the mailbox to allow mail deliveries to be in maildir format. The maildir files would be then either just moved or maybe be appended to existing mail files.
-- Oliver Schulze L. <oliver@samera.com.py>
participants (19)
-
Brian Bartholomew
-
Cor Bosman
-
Curtis Maloney
-
Dan Hollis
-
dean gaudet
-
Geo Carncross
-
Jack Bailey
-
Jakob Hirsch
-
John Peacock
-
Kenneth Porter
-
Marc Perkel
-
nodata
-
Oliver Schulze L.
-
Quest
-
Timo Sirainen
-
Todd Vierling
-
Tom Metro
-
Tom Metro
-
Tomi Hakala