[Dovecot] mbox vs. maildir storage block waste

Fri Nov 9 03:54:11 EET 2012

Obvious caveats and qualifications apply here throughout this email.

Christoph Anton Mitterer <calestyo at scientia.net> wrote:
> I see... well I haven't tested AOX or dbmail so far (especially as
> they're not in Debian and I was too lazy till now to compile them)...
> 
> At least I had the impression that performance (especially in searches)
> was one of the major things these people were proud of.
> 
> 
> I'll stay tuned, whether we ever see a fully usable SQL backend for
> Dovecot :)

I wouldn't hold your breath.

It's a recurringly seductive "meme" in email circles, but the reality is that email is mostly unstructured data with a few fields of reasonably structured data (dates, from, to, maybe attachment types + filenames).  The bulk of the emails, and the part of the emails that people really want to search quickly: the body, is unstructured, and doesn't perform quickly with the stock "full text search" modules in the main SQL engines.

I'd given dbmail2 a try with MySQL 5, 5.5, and Postgres 8.4 and 9.1 branches.  I've dedicated 16GB of DDR3-1800/3.4GHz 6-core AMD 1090T with hardware RAID local storage (12 x Seagate ES 7200RPM spindles). (64 bit Slackware 13.37 running Linux 3.2 kernels built for the platform.)

The performance is surprisingly bad ... doing almost everything.  Searches through IMAP, bulk importation of mail folders, large numbers of simultaneous mail deliveries, you name it.  There wasn't a task that the dbmail setup performed faster than Dovecot, in either low or high load situations.  When I tossed a test load that introduced lots of mail deliveries as well as searches and full folder pulls, things got really pear-shaped.  Even putting dovecot's mailstore on NFS (GigE) didn't really slow Dovecot down enough to make dbmail competitive.

When pressed on this lack of performance, I was instructed to "add more RAM" to the DB machine, and that for ideal performance I should have more RAM than my mailbox sizes.  *sigh*  This sounds great for a very small installation, but this clearly is not something that scales.

I think the final humiliation was comparing the body + header searching performance using Timo's practically obsolete fts_squat plugin against dbmail's.  Wow.  Squat was multiple orders of magnitude faster.  Lucene and Solr are even moreso when fed large datasets (mail folder hives of about 100GB).  The SQL setups hit the obvious performance shelf once they were unable to maintain everything in RAM or cache.

The dbmail folk are earnest and hard-working, and I don't mean to cast the slightest bit of negativity on their project.  I think the assumptions about what SQL servers can do well often doesn't square with the reality of many applications that people try to fit them into.

On my first initial round of tests, I imported 24,000 emails comprising a mere 560MB of space.  Just about all of the non-SQL imap servers handled the importation (basically IMAP APPENDs) within 6 minutes.  dbmail2 required hours (using MySQL), and a bit shorter time (but still hours') with Postgres.

>From an old email:

> Searching INBOX #msgs = 24714
>  [NOFIND] Time=2.072423, matches=24714 <--- this should be zero *BUG*
>  [date] Time=2.07519, matches=24714 <--- this is correct
>  [here] Time=2.072075, matches=24714 <--- this should be about 30% of total # of msgs *BUG*
> 
> Does dbmail break IMAP SEARCH TEXT (i.e., search both body + headers)?  Is this a result of relying on MySQL's search algorithms in text-like fields? I'm still puzzled, because I can't believe that 'here' appears in EVERY email.  It looks like dbmail's returning EVERY email on a SEARCH TEXT.  This is not correct operation.
> 
> When I alter the search to use "FROM" as the key instead of "TEXT", the results are more discriminating and meet expectations.
> 
> Searching INBOX #msgs = 24714
>  [NOFIND] Time=2.161049, matches=0
>  [james] Time=2.273255, matches=1049
>  [here] Time=2.165406, matches=2
> 
> Not that it matters, but it's much slower than Dovecot's fts_squat for substring searches.
> 
> Dovecot's fts_squat IMAP SEARCH TEXT results are:
> 
> Searching INBOX #msgs = 55731
>  [Updating Index] Time=78.184637 (66% of the mailbox unindexed at start)
>  [NOFIND] Time=0.045654, matches=0
>  [date] Time=0.13364, matches=55731
>  [here] Time=0.069091, matches=24663

FWIW, I found Postgres to be faster than MySQL (5 and 5.5, though 5.5 with a hand-rolled config file using metrics supplied by a dbmail/MySQL guru helped a great deal for size(data_set) < size(PHYSICAL MEMORY) cases.

Where lots of write-commits were involved on the same exact setup.  MySQL "got close" to PSQL's performance when I did crazy things like remove filesystem journaling, write barriers, etc on the mail db mountpoint.  Obviously, this is desperation talking.

I concede that the motivations behind SQLising mail storage extends to administration/replication and other non-performance/scalability aspects.  I suspect what constitutes "good enough" performance when squared against those other considerations may raise a SQL approach high enough for some people to use it.

I suspect a "NoSQL" key-value store type of database to offer much better performance than SQL RDBs, since most of the assumptions behind the storage and access patterns of email don't really fit into the SQL RDB model very efficiently.

dbmail's author and a couple of key dbmail users are very active and responsive on their mailing list, and bend over backwards to try to help new users with tuning and performance related problems.

I simply don't have enough of a budget for populating my DB machines with TBs of RAM to make it work as quickly as I need it to for my midrange mail store (10TB).

Good luck!

=R=