[Dovecot] mbox vs. maildir storage block waste
Hi.
I recently mentioned in several posts, that I'd tended to use mbox rather than maildir, because you don't loose so much space (due to always allocating full blocks per maildir file and thus per mail).
I made some tests of my archive, which consists of some 3,4 million mails at a total of 42GB). Most of these mails are probably normal sized, but there are also some with bigger attachments.
For those who are interested here are the results:
I used a 53687091200 B image file (via loop device) and tested ext4 only. btrfs is IMHO not yet ready, I have had often issues with XFS (corruptions), reiser4 is more or less dead and reiser3 is said to have issues (see e.g. its wikipedia article, even though it has that mode for small files which would fit nicely).
As you see the number of mails increased a bit, cause I tested over several days... but this is only a very small increase so it shouldn't change the numbers a lot.
- Original mbox archives (right now in Evolution) mbox exact space: 38122676224 (does not include meta-data) mbox guess space: 44625670144 (includes Evolution meta-data which is several GBs) mbox num mails: 3412999 (occurances of From_ lines)
In the following:
- image file, 1B-blocks, Used_begin, Used_end, Available_begin, Available_end result out of df -B 1
- mdir exact used space is the sum of du -B 1 for each regular file (i.e. each mdir file)
- mdir guess used space du -B 1 on the root dir of the filesystem
- mdir num mails: find . type -f | wc -l on the root dir of the filesystem
- EXT4 with 4096 blocks: image file: 53687091200 1B-blocks: 52844687360 Used_begin: 188555264 Used_end: 45198778368 Available_begin: 49971777536 Available_end: 2444972032
mdir exact used space: 44810866688 mdir guess used space: 45010243584 mdir num mails: 3423296
delta: 6.688190464 G delta / mail: 1953 B
- EXT4 with 2048 blocks: image file: 53687091200 1B-blocks: 50324295680 Used_begin: 82857984 Used_end: 41598846976 Available_begin: 47557083136 Available_end: 6041094144
mdir exact used space: 41323991040 mdir guess used space: 41516007424 mdir num mails: 3425033
delta: 3.201314816 G delta / mail: 934 B
- EXT4 with 1024 blocks: image file: 53687091200 1B-blocks: 50314834944 Used_begin: 38287360 Used_end: 39909360640 Available_begin: 47592193024 Available_end: 7721119744
mdir exact used space: 39683908608 mdir guess used space: 39871086592 mdir num mails: 3425033
delta: 1.561232384 G delta / mail: 455 B
As you can see, the delta per mail is rather close to the statistically expected values of 2048B, 1024B and 512B.
In the end I probably changed my opinion. ~7GB of wasted block space for all my mails is actually quite a lot, but in days of cheap disk space it's acceptable. And with mbox one has IMHO the major disadvantage that mailservers (including dovecot) store some meta-data _in_ it (i.e. in the mails themselves) , which I don't like a lot. I still think about reports that mbox is much faster with full text search (which sounds reasonable)... but therefore one needs probably and database backend anyway.
HTH, Chris.
On 29.10.2012, at 22.54, Christoph Anton Mitterer wrote:
I recently mentioned in several posts, that I'd tended to use mbox rather than maildir, because you don't loose so much space (due to always allocating full blocks per maildir file and thus per mail). .. In the end I probably changed my opinion. ~7GB of wasted block space for all my mails is actually quite a lot, but in days of cheap disk space it's acceptable. And with mbox one has IMHO the major disadvantage that mailservers (including dovecot) store some meta-data _in_ it (i.e. in the mails themselves) , which I don't like a lot. I still think about reports that mbox is much faster with full text search (which sounds reasonable)... but therefore one needs probably and database backend anyway.
There is of course mdbox also, which gives the best of both mbox and maildir (and some of its own new annoyances).
On Mon, 2012-10-29 at 23:06 +0200, Timo Sirainen wrote:
There is of course mdbox also, which gives the best of both mbox and maildir (and some of its own new annoyances). Thanks, Timo,... I forgot to mention that.
For me _personally_ two things speak against using it:
a) To be honest, "you must not lose the dbox index files, they can't be regenerated without data loss"[0] made me a bit scared ;-)
b) ext* has no integrity checking (by hash sums) so I used to create my own that puts SHA512 hashes into the inodes of files (as USER_XATTRS). This of course, works only when you have a storage format where files don't change anymore once written,... which can't work with formats having multiple mails per file.
Thanks, Chris.
btw: What are the actual advantages of sdbox over maildir?
On 29.10.2012, at 23.15, Christoph Anton Mitterer wrote:
btw: What are the actual advantages of sdbox over maildir?
- Not moving files from new/ to cur/ directory
- Not renaming files when changing message flags
- Not readdir()ing directories (although maildir_very_dirty_syncs=yes helps a lot with this)
Basically less disk I/O and making it possible to have mailboxes with a huge number of messages without everything slowing down horribly.
On Mon, 2012-10-29 at 23:42 +0200, Timo Sirainen wrote:
btw: What are the actual advantages of sdbox over maildir?
- Not moving files from new/ to cur/ directory
- Not renaming files when changing message flags
- Not readdir()ing directories (although maildir_very_dirty_syncs=yes helps a lot with this)
Basically less disk I/O and making it possible to have mailboxes with a huge number of messages without everything slowing down horribly.
Oh that's quite some advantage...
And I guess the interior of the files is the same? I.e. just the plain mail without any changes or quoting?
For sdbox, does that part with "loosing the indexes means game over" ;) , too?
Thanks, Chris
On 29.10.2012, at 23.54, Christoph Anton Mitterer wrote:
On Mon, 2012-10-29 at 23:42 +0200, Timo Sirainen wrote:
btw: What are the actual advantages of sdbox over maildir?
- Not moving files from new/ to cur/ directory
- Not renaming files when changing message flags
- Not readdir()ing directories (although maildir_very_dirty_syncs=yes helps a lot with this)
Basically less disk I/O and making it possible to have mailboxes with a huge number of messages without everything slowing down horribly.
Oh that's quite some advantage...
And I guess the interior of the files is the same? I.e. just the plain mail without any changes or quoting?
Yes, but it's in dbox format so it contains also some extra metadata (not in the mail headers).
For sdbox, does that part with "loosing the indexes means game over" ;) , too?
You'll lost message flags then. Both sdbox and mdbox keep dovecot.index.backup files and repairing tries very hard to preserve everything from the indexes it sees, so I don't think it's a big concern as long as the system behaves properly.
And I guess the interior of the files is the same? I.e. just the plain mail without any changes or quoting? Yes, but it's in dbox format so it contains also some extra metadata (not in the mail headers). Yeah of course... but the important point here is the "not in the mail
On Tue, 2012-10-30 at 00:05 +0200, Timo Sirainen wrote: headers" part :)
So I've added the following changes, please double check :)
http://master.wiki2.dovecot.org/MailboxFormat/dbox?action=diff&rev2=30&rev1=29
For sdbox, does that part with "loosing the indexes means game over" ;) , too? You'll lost message flags then. Both sdbox and mdbox keep dovecot.index.backup files and repairing tries very hard to preserve everything from the indexes it sees, so I don't think it's a big concern as long as the system behaves properly. Yeah... sounds not too bad... :)
Off topic:
Have you ever thought about adding a "real" DB backend? Nothing against dbox... :) ... and I have no performance comparison of dbox with what could be done with a DBMS... but the advantage of the later would be that you get all fancy features from database systems for free... like fast indexing, online replication, etc. p..
One might even reuse something like AOX for this.
Cheers, Chris.
On 30.10.2012, at 2.16, Christoph Anton Mitterer wrote:
Have you ever thought about adding a "real" DB backend? Nothing against dbox... :) ... and I have no performance comparison of dbox with what could be done with a DBMS... but the advantage of the later would be that you get all fancy features from database systems for free... like fast indexing, online replication, etc. p..
One might even reuse something like AOX for this.
SQL indexes aren't very helpful for IMAP-like data. It would be fun to some day have SQL backend in Dovecot (there already is read-only INBOX-only SQL backend), but I don't expect it to have very good performance.
On Wed, 2012-11-07 at 17:30 +0200, Timo Sirainen wrote:
On 30.10.2012, at 2.16, Christoph Anton Mitterer wrote:
Have you ever thought about adding a "real" DB backend? Nothing against dbox... :) ... and I have no performance comparison of dbox with what could be done with a DBMS... but the advantage of the later would be that you get all fancy features from database systems for free... like fast indexing, online replication, etc. p.. One might even reuse something like AOX for this.
SQL indexes aren't very helpful for IMAP-like data. It would be fun to some day have SQL backend in Dovecot (there already is read-only INBOX-only SQL backend), but I don't expect it to have very good performance.
I see... well I haven't tested AOX or dbmail so far (especially as they're not in Debian and I was too lazy till now to compile them)...
At least I had the impression that performance (especially in searches) was one of the major things these people were proud of.
I'll stay tuned, whether we ever see a fully usable SQL backend for Dovecot :)
Cheers, Chris.
Obvious caveats and qualifications apply here throughout this email.
Christoph Anton Mitterer calestyo@scientia.net wrote:
I see... well I haven't tested AOX or dbmail so far (especially as they're not in Debian and I was too lazy till now to compile them)...
At least I had the impression that performance (especially in searches) was one of the major things these people were proud of.
I'll stay tuned, whether we ever see a fully usable SQL backend for Dovecot :)
I wouldn't hold your breath.
It's a recurringly seductive "meme" in email circles, but the reality is that email is mostly unstructured data with a few fields of reasonably structured data (dates, from, to, maybe attachment types + filenames). The bulk of the emails, and the part of the emails that people really want to search quickly: the body, is unstructured, and doesn't perform quickly with the stock "full text search" modules in the main SQL engines.
I'd given dbmail2 a try with MySQL 5, 5.5, and Postgres 8.4 and 9.1 branches. I've dedicated 16GB of DDR3-1800/3.4GHz 6-core AMD 1090T with hardware RAID local storage (12 x Seagate ES 7200RPM spindles). (64 bit Slackware 13.37 running Linux 3.2 kernels built for the platform.)
The performance is surprisingly bad ... doing almost everything. Searches through IMAP, bulk importation of mail folders, large numbers of simultaneous mail deliveries, you name it. There wasn't a task that the dbmail setup performed faster than Dovecot, in either low or high load situations. When I tossed a test load that introduced lots of mail deliveries as well as searches and full folder pulls, things got really pear-shaped. Even putting dovecot's mailstore on NFS (GigE) didn't really slow Dovecot down enough to make dbmail competitive.
When pressed on this lack of performance, I was instructed to "add more RAM" to the DB machine, and that for ideal performance I should have more RAM than my mailbox sizes. *sigh* This sounds great for a very small installation, but this clearly is not something that scales.
I think the final humiliation was comparing the body + header searching performance using Timo's practically obsolete fts_squat plugin against dbmail's. Wow. Squat was multiple orders of magnitude faster. Lucene and Solr are even moreso when fed large datasets (mail folder hives of about 100GB). The SQL setups hit the obvious performance shelf once they were unable to maintain everything in RAM or cache.
The dbmail folk are earnest and hard-working, and I don't mean to cast the slightest bit of negativity on their project. I think the assumptions about what SQL servers can do well often doesn't square with the reality of many applications that people try to fit them into.
On my first initial round of tests, I imported 24,000 emails comprising a mere 560MB of space. Just about all of the non-SQL imap servers handled the importation (basically IMAP APPENDs) within 6 minutes. dbmail2 required hours (using MySQL), and a bit shorter time (but still hours') with Postgres.
From an old email:
Searching INBOX #msgs = 24714 [NOFIND] Time=2.072423, matches=24714 <--- this should be zero *BUG* [date] Time=2.07519, matches=24714 <--- this is correct [here] Time=2.072075, matches=24714 <--- this should be about 30% of total # of msgs *BUG*
Does dbmail break IMAP SEARCH TEXT (i.e., search both body + headers)? Is this a result of relying on MySQL's search algorithms in text-like fields? I'm still puzzled, because I can't believe that 'here' appears in EVERY email. It looks like dbmail's returning EVERY email on a SEARCH TEXT. This is not correct operation.
When I alter the search to use "FROM" as the key instead of "TEXT", the results are more discriminating and meet expectations.
Searching INBOX #msgs = 24714 [NOFIND] Time=2.161049, matches=0 [james] Time=2.273255, matches=1049 [here] Time=2.165406, matches=2
Not that it matters, but it's much slower than Dovecot's fts_squat for substring searches.
Dovecot's fts_squat IMAP SEARCH TEXT results are:
Searching INBOX #msgs = 55731 [Updating Index] Time=78.184637 (66% of the mailbox unindexed at start) [NOFIND] Time=0.045654, matches=0 [date] Time=0.13364, matches=55731 [here] Time=0.069091, matches=24663
FWIW, I found Postgres to be faster than MySQL (5 and 5.5, though 5.5 with a hand-rolled config file using metrics supplied by a dbmail/MySQL guru helped a great deal for size(data_set) < size(PHYSICAL MEMORY) cases.
Where lots of write-commits were involved on the same exact setup. MySQL "got close" to PSQL's performance when I did crazy things like remove filesystem journaling, write barriers, etc on the mail db mountpoint. Obviously, this is desperation talking.
I concede that the motivations behind SQLising mail storage extends to administration/replication and other non-performance/scalability aspects. I suspect what constitutes "good enough" performance when squared against those other considerations may raise a SQL approach high enough for some people to use it.
I suspect a "NoSQL" key-value store type of database to offer much better performance than SQL RDBs, since most of the assumptions behind the storage and access patterns of email don't really fit into the SQL RDB model very efficiently.
dbmail's author and a couple of key dbmail users are very active and responsive on their mailing list, and bend over backwards to try to help new users with tuning and performance related problems.
I simply don't have enough of a budget for populating my DB machines with TBs of RAM to make it work as quickly as I need it to for my midrange mail store (10TB).
Good luck!
=R=
robin - what a great write up! thanks!
On Fri, Nov 9, 2012 at 8:54 AM, Robin dovecot@r.paypc.com wrote:
Obvious caveats and qualifications apply here throughout this email.
Christoph Anton Mitterer calestyo@scientia.net wrote:
I see... well I haven't tested AOX or dbmail so far (especially as they're not in Debian and I was too lazy till now to compile them)...
At least I had the impression that performance (especially in searches) was one of the major things these people were proud of.
I'll stay tuned, whether we ever see a fully usable SQL backend for Dovecot :)
I wouldn't hold your breath.
It's a recurringly seductive "meme" in email circles, but the reality is that email is mostly unstructured data with a few fields of reasonably structured data (dates, from, to, maybe attachment types + filenames). The bulk of the emails, and the part of the emails that people really want to search quickly: the body, is unstructured, and doesn't perform quickly with the stock "full text search" modules in the main SQL engines.
I'd given dbmail2 a try with MySQL 5, 5.5, and Postgres 8.4 and 9.1 branches. I've dedicated 16GB of DDR3-1800/3.4GHz 6-core AMD 1090T with hardware RAID local storage (12 x Seagate ES 7200RPM spindles). (64 bit Slackware 13.37 running Linux 3.2 kernels built for the platform.)
The performance is surprisingly bad ... doing almost everything. Searches through IMAP, bulk importation of mail folders, large numbers of simultaneous mail deliveries, you name it. There wasn't a task that the dbmail setup performed faster than Dovecot, in either low or high load situations. When I tossed a test load that introduced lots of mail deliveries as well as searches and full folder pulls, things got really pear-shaped. Even putting dovecot's mailstore on NFS (GigE) didn't really slow Dovecot down enough to make dbmail competitive.
When pressed on this lack of performance, I was instructed to "add more RAM" to the DB machine, and that for ideal performance I should have more RAM than my mailbox sizes. *sigh* This sounds great for a very small installation, but this clearly is not something that scales.
I think the final humiliation was comparing the body + header searching performance using Timo's practically obsolete fts_squat plugin against dbmail's. Wow. Squat was multiple orders of magnitude faster. Lucene and Solr are even moreso when fed large datasets (mail folder hives of about 100GB). The SQL setups hit the obvious performance shelf once they were unable to maintain everything in RAM or cache.
The dbmail folk are earnest and hard-working, and I don't mean to cast the slightest bit of negativity on their project. I think the assumptions about what SQL servers can do well often doesn't square with the reality of many applications that people try to fit them into.
On my first initial round of tests, I imported 24,000 emails comprising a mere 560MB of space. Just about all of the non-SQL imap servers handled the importation (basically IMAP APPENDs) within 6 minutes. dbmail2 required hours (using MySQL), and a bit shorter time (but still hours') with Postgres.
From an old email:
Searching INBOX #msgs = 24714 [NOFIND] Time=2.072423, matches=24714 <--- this should be zero *BUG* [date] Time=2.07519, matches=24714 <--- this is correct [here] Time=2.072075, matches=24714 <--- this should be about 30% of total # of msgs *BUG*
Does dbmail break IMAP SEARCH TEXT (i.e., search both body + headers)? Is this a result of relying on MySQL's search algorithms in text-like fields? I'm still puzzled, because I can't believe that 'here' appears in EVERY email. It looks like dbmail's returning EVERY email on a SEARCH TEXT. This is not correct operation.
When I alter the search to use "FROM" as the key instead of "TEXT", the results are more discriminating and meet expectations.
Searching INBOX #msgs = 24714 [NOFIND] Time=2.161049, matches=0 [james] Time=2.273255, matches=1049 [here] Time=2.165406, matches=2
Not that it matters, but it's much slower than Dovecot's fts_squat for substring searches.
Dovecot's fts_squat IMAP SEARCH TEXT results are:
Searching INBOX #msgs = 55731 [Updating Index] Time=78.184637 (66% of the mailbox unindexed at start) [NOFIND] Time=0.045654, matches=0 [date] Time=0.13364, matches=55731 [here] Time=0.069091, matches=24663
FWIW, I found Postgres to be faster than MySQL (5 and 5.5, though 5.5 with a hand-rolled config file using metrics supplied by a dbmail/MySQL guru helped a great deal for size(data_set) < size(PHYSICAL MEMORY) cases.
Where lots of write-commits were involved on the same exact setup. MySQL "got close" to PSQL's performance when I did crazy things like remove filesystem journaling, write barriers, etc on the mail db mountpoint. Obviously, this is desperation talking.
I concede that the motivations behind SQLising mail storage extends to administration/replication and other non-performance/scalability aspects. I suspect what constitutes "good enough" performance when squared against those other considerations may raise a SQL approach high enough for some people to use it.
I suspect a "NoSQL" key-value store type of database to offer much better performance than SQL RDBs, since most of the assumptions behind the storage and access patterns of email don't really fit into the SQL RDB model very efficiently.
dbmail's author and a couple of key dbmail users are very active and responsive on their mailing list, and bend over backwards to try to help new users with tuning and performance related problems.
I simply don't have enough of a budget for populating my DB machines with TBs of RAM to make it work as quickly as I need it to for my midrange mail store (10TB).
Good luck!
=R=
Am 09.11.2012 02:54, schrieb Robin:
I'll stay tuned, whether we ever see a fully usable SQL backend for
Dovecot :)
thats not a new idea, but there is still tons of stuff which has to coded in more prime, as dovecot works nice with other existing storage file backends, there isnt hard pressure for sql storage, but feel free to code your own , youre welcome
Best Regards MfG Robert Schetterer
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Joerg Heidrich
On Thu, 2012-11-08 at 17:54 -0800, Robin wrote:
The performance is surprisingly bad ... doing almost everything. Searches through IMAP, bulk importation of mail folders, large numbers of simultaneous mail deliveries, you name it. Have you made systematic tests? I.e. compared times for all of these with those from the different dovecot backends.
There wasn't a task that the dbmail setup performed faster than Dovecot, in either low or high load situations. Which backend did you use?
When pressed on this lack of performance, I was instructed to "add more RAM" to the DB machine, and that for ideal performance I should have more RAM than my mailbox sizes. *sigh* This sounds great for a very small installation, but this clearly is not something that scales. Yeah... that’s truly disappointing...
Do you have detailed numbers?
I guess you’ve "only" tried dbmail?
The dbmail folk are earnest and hard-working, and I don't mean to cast the slightest bit of negativity on their project. I think the assumptions about what SQL servers can do well often doesn't square with the reality of many applications that people try to fit them into. hmm...
remove filesystem journaling, write barriers, etc on the mail db mountpoint. All something I wouldn’t want to do on my production systems ;)
Thanks for your detailed information :)
Cheers, Chris.
On 11/11/2012 5:26 PM, Christoph Anton Mitterer wrote:
Have you made systematic tests? I.e. compared times for all of these with those from the different dovecot backends.
The choice of Dovecot backends made no substantial difference. I used maildir, sdbox, and mdbox. I also added SiS (with mdbox). Initial tests were on local multi-spindle RAID5 storage, but to handicap Dovecot, I pushed it over NFS (also Linux 3.2 on a local GigE segment). It wasn't slow enough to make dbmail competitive, even though you have to start turning off performance optimisation features in Dovecot to avoid NFS bugs.
There wasn't a task that the dbmail setup performed faster than Dovecot, in either low or high load situations. Which backend did you use?
Backend for dbmail? Two MySQL versions (5.0 and 5.5) - InnoDB is required for dbmail, by the way. Postgres 8.4 and 9.1 backends, using its default storage engine. I tried the tests with both a separate DB machine, as well as a cohosted one with the dbmail connector using local sockets instead of TCP/IP, but that didn't significantly alter the performance.
I've found my first notes from the tests. It was the second round of tests with the latest MySQL 5.0 server given some tuning to more aggressively use system memory. You will note the puny size of the mail folder hive in this round.
The mysqld process has consumed nearly an hour of CPU time during this process. dbmail is configured to use local sockets rather than network I/O.
I'm using the PERL MailTools http://search.cpan.org/dist/MailTools/ to import about 10 folders' worth of email, totaling about 560MB in raw size, constituting about 23,000 emails. The script basically creates the folders, and does an APPEND for each email. It's bog simple.
I DROP the database, recreated it, added the one user, verify DBMail accepts authentication for the newly created mailbox, and then do the import. The MySQL files live on a freshly formatted ext4 filesystem.
The import takes Dovecot (MailDir or mdbox format), or Panda IMAP (mix) about six minutes to complete.
DBMail 3 took 4h 23m. Casual inspection of the system showed modestly high CPU usage in mysqld and dbmail-imapd (as well as the import perl command on occasion), but the Load Average didn't get too close to 1.0, let alone 2.0, which concerns me that I might have hit some kind of "busy wait" pathology.
To clarify the above: To streamline iterative testing, I made a script to deactivate the currently running SQL server, unmount, re-format, re-mount, and re-populate the skeletal DB directories and restart the DB engine. So between each test, no matter the imapd or DB back-end, the mailstore was presented with a freshly formatted volume on dedicated spindles. The filesystem was ext4, formatted with:
lazy_itable_init=0,lazy_journal_init=0,dir_index=1,extents=1,uninit_bg=0,flex_bg=0,has_journal=0,inode_size=256,dir_index=1,
Do you have detailed numbers?
Not really, but after it was clear that I wasn't going to get comparable performance even within the same magnitude, I stopped testing it. I included the IMAP SEARCH performance comparison against fts_squat in my original mail to this list. In addition to huge performance deficiencies, it also has/had fatal operational bugs.
I guess you’ve "only" tried dbmail?
I did try Manitou, but the lack of a proper IMAP service for it made extensive "like for like" testing very difficult. Manitou is still in the very early days, alas. It also relies on the SQL DB's underlying authentication systems which is rather ... alarming. It performs quite a bit better than dbmail, but still it's not close to Dovecot. At the time I tested it, only custom-rolled clients could talk to it, i.e., no imap4/pop3 "gateways" to it.
I think I was most alarmed to see that the widely assumed benefits of putting mail on a SQL DB, i.e., fast searching/sorting, didn't actually happen in reality.
As others have mentioned, I also shudder to think of backup/restore issues, especially on a single user level. The mechanisms of backing up and restoring maildirs and even mdboxes, i.e., simple files, are not only well understood, the failure modes are generally fully recoverable. SQL-DB file blobs, especially with MySQL, remind me too much of the "PST Hell" that Exchange administrators face. But maybe that's just my ignorance talking.
All something I wouldn’t want to do on my production systems ;)
Neither would I. But as I said, I was "desperate" to get this close to Dovecot's performance. I had about 2-3 weeks to pre-qualify mail storage back-ends with an eye towards 4 or 5 digits of usercount, and maybe tens to hundreds of TBs' scale of mail storage. Running across such poor performance with such relatively small loads disqualified the DB-based mail products very very quickly, for ME, anyway.
If you want to run your own tests, my suggestion is to start with Postgres, put as much RAM into your DB machine as you can afford, and maybe populate your DB machine exclusively with SSDs.
=R=
On 13.11.2012, at 0.44, Robin wrote:
On 11/11/2012 5:26 PM, Christoph Anton Mitterer wrote:
Have you made systematic tests? I.e. compared times for all of these with those from the different dovecot backends.
The choice of Dovecot backends made no substantial difference. I used maildir, sdbox, and mdbox. I also added SiS (with mdbox). Initial tests were on local multi-spindle RAID5 storage,
With local disks the tests often measure only the local RAM/CPU speed, unless you're testing thousands of users.
but to handicap Dovecot, I pushed it over NFS (also Linux 3.2 on a local GigE segment). It wasn't slow enough to make dbmail competitive, even though you have to start turning off performance optimisation features in Dovecot to avoid NFS bugs.
NFS makes a better test case if you're measuring single user performance. Much of it is probably due to the index file access latency, although not all. In some cases Dovecot's prefetching mails can help (maildir, sdbox backends with local disks currently, nothing preventing it from working in other use cases though, even with Dovecot-SQL backend).
I guess you’ve "only" tried dbmail?
I did try Manitou, but the lack of a proper IMAP service for it made extensive "like for like" testing very difficult. Manitou is still in the very early days, alas. It also relies on the SQL DB's underlying authentication systems which is rather ... alarming. It performs quite a bit better than dbmail, but still it's not close to Dovecot. At the time I tested it, only custom-rolled clients could talk to it, i.e., no imap4/pop3 "gateways" to it.
Manitou seems to advertise itself as being email client .. although then also seems to say SQL is faster than IMAP (which doesn't make much sense itself).
I think I was most alarmed to see that the widely assumed benefits of putting mail on a SQL DB, i.e., fast searching/sorting, didn't actually happen in reality.
SQL has nothing that makes any type of email access even potentially efficient. SQL indexes are mostly about binary trees, and there are about zero things in IMAP where I have thought of binary tree being even potentially useful. (Okay, potentially for expunging old mails when you have >1M mails in one folder. Not something you normally optimize for.)
With most of Dovecot's optimized lookups, latency is the most important thing. SQL is bad for latency. With remote systems it's usually much faster to just download 1 MB blob and parse it than fetch a couple of 100 byte blocks.
As others have mentioned, I also shudder to think of backup/restore issues, especially on a single user level. The mechanisms of backing up and restoring maildirs and even mdboxes, i.e., simple files, are not only well understood, the failure modes are generally fully recoverable. SQL-DB file blobs, especially with MySQL, remind me too much of the "PST Hell" that Exchange administrators face. But maybe that's just my ignorance talking.
I'd think everyone would use the human-readable SQL dumps for database backups. At least with MySQL/PostgreSQL I wouldn't really trust anything else.
Uh..
On 13.11.2012, at 1.02, Timo Sirainen wrote:
On 13.11.2012, at 0.44, Robin wrote:
On 11/11/2012 5:26 PM, Christoph Anton Mitterer wrote:
Have you made systematic tests? I.e. compared times for all of these with those from the different dovecot backends.
The choice of Dovecot backends made no substantial difference. I used maildir, sdbox, and mdbox. I also added SiS (with mdbox). Initial tests were on local multi-spindle RAID5 storage,
With local disks the tests often measure only the local RAM/CPU speed, unless you're testing thousands of users.
..measuring disk I/O most importantly.
but to handicap Dovecot, I pushed it over NFS (also Linux 3.2 on a local GigE segment). It wasn't slow enough to make dbmail competitive, even though you have to start turning off performance optimisation features in Dovecot to avoid NFS bugs.
NFS makes a better test case if you're measuring single user performance. Much of it is probably due to the index file access latency, although not all. In some cases Dovecot's prefetching mails can help (maildir, sdbox backends with local disks currently, nothing preventing it from working in other use cases though, even with Dovecot-SQL backend).
Prefetching is done only with mail_prefetch_count setting. Someone in blog.dovecot.org mentioned that it was bad for performance with local disk+maildir. Linux apparently doesn't do this with NFS. It would of course be possible to just have the prefetching create a new thread/process to download the mail locally and read it (similar to what the object storage plugin does).
Christoph Anton Mitterer wrote:
On Wed, 2012-11-07 at 17:30 +0200, Timo Sirainen wrote:
On 30.10.2012, at 2.16, Christoph Anton Mitterer wrote:
Have you ever thought about adding a "real" DB backend? Nothing against dbox... :) ... and I have no performance comparison of dbox with what could be done with a DBMS... but the advantage of the later would be that you get all fancy features from database systems for free... like fast indexing, online replication, etc. p.. One might even reuse something like AOX for this.
SQL indexes aren't very helpful for IMAP-like data. It would be fun to some day have SQL backend in Dovecot (there already is read-only INBOX-only SQL backend), but I don't expect it to have very good performance. I see... well I haven't tested AOX or dbmail so far (especially as they're not in Debian and I was too lazy till now to compile them)...
Bad performance experiences with dbmail 2.x were the main reason why we migrated to dovecot. If you've got a MySQL database with 80 GB of binary chunks then things are getting ugly, especially when it comes to efficient backup and restore of whole mailboxes or single e-mails. The SQL backend (and the IMAP user experience) becomes very slow if the database does not fit completely into RAM.
There are many performance improvements and bug fixes in dbmail 3.x, but instead of evaluating then, we decided to migrate to Dovecot.
One should think twice, or even three times about how to design an efficient SQL backend for a good user experience.
Regards Daniel
On 2012-10-29 5:42 PM, Timo Sirainen tss@iki.fi wrote:
On 29.10.2012, at 23.15, Christoph Anton Mitterer wrote:
btw: What are the actual advantages of sdbox over maildir?
- Not moving files from new/ to cur/ directory
- Not renaming files when changing message flags
- Not readdir()ing directories (although maildir_very_dirty_syncs=yes helps a lot with this)
Basically less disk I/O and making it possible to have mailboxes with a huge number of messages without everything slowing down horribly.
I had been wanting to ask about this too...
So... what are the disadvantages?
--
Best regards,
Charles
On Tue, 2012-10-30 at 07:00 -0400, Charles Marcus wrote:
So... what are the disadvantages? I (but I'm no expert) would guess that it's a dovecot-only format. No support from most other tools,...
I'd guess you cannot use e.g. maildrop with it, or can you?
I personally was always a bit worried, when meta-data is put in the mail... now AFAIU dbox does _not_ do this... and you can cleanly extract each unmodified mail from the dbox fail (single or multi), right?
Cheers, Chris.
On 30.10.2012, at 13.00, Charles Marcus wrote:
On 2012-10-29 5:42 PM, Timo Sirainen tss@iki.fi wrote:
On 29.10.2012, at 23.15, Christoph Anton Mitterer wrote:
btw: What are the actual advantages of sdbox over maildir?
- Not moving files from new/ to cur/ directory
- Not renaming files when changing message flags
- Not readdir()ing directories (although maildir_very_dirty_syncs=yes helps a lot with this)
Basically less disk I/O and making it possible to have mailboxes with a huge number of messages without everything slowing down horribly.
I had been wanting to ask about this too...
So... what are the disadvantages?
Message flags are stored only in dovecot.index files, and files get somewhat more easily corrupted than the whole filesystem. Having a separate dovecot.index.backup file helps with this though. Also there's the disadvantages if you can't easily switch away from Maildir because you're using some non-Dovecot tools to access it.
On 2012-10-29 4:54 PM, Christoph Anton Mitterer calestyo@scientia.net wrote:
In the end I probably changed my opinion. ~7GB of wasted block space for all my mails is actually quite a lot, but in days of cheap disk space it's acceptable. And with mbox one has IMHO the major disadvantage that mailservers (including dovecot) store some meta-data_in_ it (i.e. in the mails themselves) , which I don't like a lot. I still think about reports that mbox is much faster with full text search (which sounds reasonable)... but therefore one needs probably and database backend anyway.
What makes the most sense for me is to use mbox (or mdbox) for longer term storage that you may be offloading to slower storage systems, and use maildir (or sdbox) for the new mails...
Would work great as long as you have a reliable method for archiving older mails out to your slower storage.
This is what I plan on doing someday...
--
Best regards,
Charles
What makes the most sense for me is to use mbox (or mdbox) for longer term storage that you may be offloading to slower storage systems, and use maildir (or sdbox) for the new mails... Was also something I thought about... still the more I think about it,
On Tue, 2012-10-30 at 07:03 -0400, Charles Marcus wrote: the more I hate, that with mbox meta-data is stored in the mails.
Would work great as long as you have a reliable method for archiving older mails out to your slower storage. I still hope for some DB backend ;)
Chris.
participants (7)
-
cc "maco" young
-
Charles Marcus
-
Christoph Anton Mitterer
-
Daniel Parthey
-
Robert Schetterer
-
Robin
-
Timo Sirainen