[Dovecot] Using MySQL to store email?
So Timo,
Have you considered the idea of storing all the email in a MySQL database? Seems to me that MySQL could somplify all the backend stuff that everyone struggles with and with replication one could create very massive and reliable systems. What would it take to use MySQL that way?
On Tue, Jun 06, 2006 at 05:38:41AM -0700, Marc Perkel wrote:
So Timo,
Have you considered the idea of storing all the email in a MySQL database? Seems to me that MySQL could somplify all the backend stuff that everyone struggles with and with replication one could create very massive and reliable systems. What would it take to use MySQL that way?
PowerMail does that, but unlike PowerDNS there doesn't seem to be loads of users / testimonials / active mailing lists. It's open source though, so you can download and try it out :-) PowerDNS by the same author is definitely best-of-breed, a must-have if you'd like your DNS in SQL.
* On 06/06/06 05:38 -0700, Marc Perkel wrote:
| So Timo,
|
| Have you considered the idea of storing all the email in a MySQL
| database? Seems to me that MySQL could somplify all the backend stuff
| that everyone struggles with and with replication one could create very
| massive and reliable systems. What would it take to use MySQL that way?
Hi Marc,
Did I miss something? I thought the mail is delivered by another daemon
different than dovecot ;)
-Wash
http://www.netmeister.org/news/learn2quote.html
DISCLAIMER: See http://www.wananchi.com/bms/terms.php
--
+======================================================================+
|\ _,,,---,,_ | Odhiambo Washington
Odhiambo WASHINGTON wrote:
- On 06/06/06 05:38 -0700, Marc Perkel wrote: | So Timo, | | Have you considered the idea of storing all the email in a MySQL | database? Seems to me that MySQL could somplify all the backend stuff | that everyone struggles with and with replication one could create very | massive and reliable systems. What would it take to use MySQL that way?
Hi Marc,
Did I miss something? I thought the mail is delivered by another daemon different than dovecot ;)
That's true. I would assume that a local delivery agent would also have to be written to pipe in email from Exim or other MTA to get the email into the database.
* On 06/06/06 06:24 -0700, Marc Perkel wrote:
|
|
| Odhiambo WASHINGTON wrote:
| >* On 06/06/06 05:38 -0700, Marc Perkel wrote:
| >| So Timo,
| >|
| >| Have you considered the idea of storing all the email in a MySQL
| >| database? Seems to me that MySQL could somplify all the backend stuff
| >| that everyone struggles with and with replication one could create very
| >| massive and reliable systems. What would it take to use MySQL that way?
| >
| >Hi Marc,
| >
| >Did I miss something? I thought the mail is delivered by another daemon
| >different than dovecot ;)
| >
| >
| >
| >
|
| That's true. I would assume that a local delivery agent would also have
| to be written to pipe in email from Exim or other MTA to get the email
| into the database.
DBMail already does this, and is configurable with other MTAs ;-)
-Wash
http://www.netmeister.org/news/learn2quote.html
DISCLAIMER: See http://www.wananchi.com/bms/terms.php
--
+======================================================================+
|\ _,,,---,,_ | Odhiambo Washington
Odhiambo WASHINGTON wrote:
DBMail already does this, and is configurable with other MTAs ;-)
Don't know how secure is it. I was researching it but decided to stay in dovecot + Maildir.
Oliver
-- Oliver Schulze L. oliver@samera.com.py
* On 06/06/06 20:20 -0400, Oliver Schulze L. wrote:
| Odhiambo WASHINGTON wrote:
| >DBMail already does this, and is configurable with other MTAs ;-)
| >
| yep,
| http://www.dbmail.org/
|
| Don't know how secure is it. I was researching it but decided to stay
| in dovecot + Maildir.
Another commercial software, @Mail (atmail.com), used to store messages
in the DB by default, but later changed to Maildir.
-Wash
http://www.netmeister.org/news/learn2quote.html
DISCLAIMER: See http://www.wananchi.com/bms/terms.php
--
+======================================================================+
|\ _,,,---,,_ | Odhiambo Washington
So Timo,
Have you considered the idea of storing all the email in a MySQL database?
dovecot already writes to a storage backend optimised for this task - the filesystem.
Seems to me that MySQL could somplify all the backend stuff that everyone struggles
Who is "everyone"?
What problems are you having?
with and with replication one could create very massive and reliable systems. What would it take to use MySQL that way?
Massive and reliable? What examples can you give where a MySQL backend for a mail system would improve things? How would it improve them?
Putting everything in a database would provide one benefit:
- Less storage space needed due to duplicated e-mail
NOT putting everything in a database provides plays to Linux's strengths: everything is a file, meaning we can use all of the standard file-focused text processing tools. If everything is a file, backups and restores are a piece of cake.
On 06-06-2006 15:48:20 +0200, nodata wrote:
Putting everything in a database would provide one benefit:
- Less storage space needed due to duplicated e-mail
... assuming that the mails are completely identical (i.e. have no slightly different headers) and that the database backend does expensive duplicate elimination on its string heaps.
So the real benefit you get I think is:
- full text searching capabilities based on the IR-like indices provided/maintained by the database backend, such as for instance: http://dev.mysql.com/doc/refman/5.0/en/fulltext-query-expansion.html
-- Fabian Groffen Gentoo for Mac OS X Project
On Tue, 6 Jun 2006, nodata wrote:
Seems to me that MySQL could somplify all the backend stuff
There are different SQL-Servers out there, e.g. Postgres.
Putting everything in a database would provide one benefit:
- Less storage space needed due to duplicated e-mail
Don't know, if such stuff will work out-of-the-box, but if the mails is broken into headers and bodies (for the message awhole and each MIME part), it can be very beneficial.
NOT putting everything in a database provides plays to Linux's strengths: everything is a file, meaning we can use all of the standard file-focused
And limits. Why does Dovecot has so many options for locking? ;-) What are the files days old in the tmp/ folders?
text processing tools. If everything is a file, backups and restores are a piece of cake.
Well, the problem with a file-based database (Dovecot's indexes etc. are in fact a database) is that you must use the same locking and/or terminate / suspend the service, otherwise there is the possibility that the data and the indexes are out-of-sync. The transactions of a full-featured DB will make backups and restores reliable, however, partial restores won't be possible without a good toolset. Do you know the spot of time _for_ _sure_ Dovecot has committed all internally cached data into the index- or control files?
Moving away from the filesystem will introduce other problems, e.g. one needs to introduce a full ACL implementation. And no operation can be easily performed in a DB manually, as you can with files. However, for the Dovecot files (e.g. keywords of a message) you need a toolset even today. For a SQL-Server, you need to know the structure and you can built a toolset with any language you like.
I'm neither advocating filesystem- nor DB-based mail storage, both has there strength and weakness. But I'd expect that you can rely on the DB's strangth for indexing and storing data well and you can focus on the processing of mails.
BTW: Cyrus uses a DB, doesn't it?
Bye,
-- Steffen Kaiser
Steffen Kaiser wrote:
On Tue, 6 Jun 2006, nodata wrote:
Seems to me that MySQL could somplify all the backend stuff
There are different SQL-Servers out there, e.g. Postgres.
Putting everything in a database would provide one benefit:
- Less storage space needed due to duplicated e-mail
Don't know, if such stuff will work out-of-the-box, but if the mails is broken into headers and bodies (for the message awhole and each MIME part), it can be very beneficial.
NOT putting everything in a database provides plays to Linux's strengths: everything is a file, meaning we can use all of the standard file-focused
And limits. Why does Dovecot has so many options for locking? ;-) What are the files days old in the tmp/ folders?
text processing tools. If everything is a file, backups and restores are a piece of cake.
Well, the problem with a file-based database (Dovecot's indexes etc. are in fact a database) is that you must use the same locking and/or terminate / suspend the service, otherwise there is the possibility that the data and the indexes are out-of-sync. The transactions of a full-featured DB will make backups and restores reliable, however, partial restores won't be possible without a good toolset. Do you know the spot of time _for_ _sure_ Dovecot has committed all internally cached data into the index- or control files?
Moving away from the filesystem will introduce other problems, e.g. one needs to introduce a full ACL implementation. And no operation can be easily performed in a DB manually, as you can with files. However, for the Dovecot files (e.g. keywords of a message) you need a toolset even today. For a SQL-Server, you need to know the structure and you can built a toolset with any language you like.
I'm neither advocating filesystem- nor DB-based mail storage, both has there strength and weakness. But I'd expect that you can rely on the DB's strangth for indexing and storing data well and you can focus on the processing of mails.
BTW: Cyrus uses a DB, doesn't it?
I'm suggesting it in addition to MBOX and MAILDIR. And of course if there's a MySQL version then other databases will follow. Just seems to me that if I were running a really BIG email operation that MySQL could have some serious benefits.
On Tuesday 06 Jun 2006 16:25, Marc Perkel wrote:
Steffen Kaiser wrote:
On Tue, 6 Jun 2006, nodata wrote:
Seems to me that MySQL could somplify all the backend stuff
There are different SQL-Servers out there, e.g. Postgres.
Putting everything in a database would provide one benefit:
- Less storage space needed due to duplicated e-mail
Unlikely -- that would depend on the delivery agent (and MTA) handling things appropriately before it gets to dovecot. This is a difficult problem in SMTP, although somethings that combined MTA and POP type systems have solved it (kind of).
And limits. Why does Dovecot has so many options for locking? ;-)
Coping with a plethora of other products that already exist to do the email delivery. No sane person deploys email without at least maildir.
What are the files days old in the tmp/ folders?
All empty here.
Well, the problem with a file-based database (Dovecot's indexes etc. are in fact a database) is that you must use the same locking and/or terminate / suspend the service, otherwise there is the possibility that the data and the indexes are out-of-sync.
Yes, but indexes are cheap to rebuild, but expensive to maintain, so you might find this cuts the wrong way.
I'm quite a fan of the idea of putting email in databases, I can see the upside. But those who think it will save any resource at all haven't spent enough time with big database systems. It will be a lot slower, except where you can utilise indexes to speed operations, which will be rarely if at all.
Just consider the number of blocking writes to commit an email to maildir (remember it uses a lot of rename), now consider the kind of indexes you want to maintain on the database that'll be updated when an email is delivered (and possibly when it is read, files etc).
I got into pondering mail in databases from the issues pertaining to consistency of reads of directories in Unix filesystems. Whilst it is easy to guarantee the consistency of a read from an ACID style database (unlike reading directories in a big maildir folder). Of course when I asked Hans Reiser he said it sounds like the kind of modular functionality that modern filesystems ought to provide and offered to write a filesystem plugin for ReiserFS that guarantees the consistency of directory reads for maildir use. Of course there is a performance (or resource) penalty in doing a consistent read of a directory.
Maybe more than one way to solve a problem, just need to make sure you know precisely which problems you are trying to solve.
Simon, who'll continue moving systems to maildir, till something better arrives.
On Tue, 2006-06-06 at 17:02 +0100, Simon Waters wrote:
Well, the problem with a file-based database (Dovecot's indexes etc. are in fact a database) is that you must use the same locking and/or terminate / suspend the service, otherwise there is the possibility that the data and the indexes are out-of-sync.
Yes, but indexes are cheap to rebuild, but expensive to maintain, so you might find this cuts the wrong way.
I'm quite a fan of the idea of putting email in databases, I can see the upside. But those who think it will save any resource at all haven't spent enough time with big database systems. It will be a lot slower, except where you can utilise indexes to speed operations, which will be rarely if at all.
Just consider the number of blocking writes to commit an email to maildir (remember it uses a lot of rename), now consider the kind of indexes you want to maintain on the database that'll be updated when an email is delivered (and possibly when it is read, files etc).
I think the people who expect an improvement from databases over maildir are used to unix filesystems that degrade badly as the number of files in a directory increase. These days many, like Reiserfs and XFS, are much better. My theory is that if your filesystem isn't a good place to store things you should fix that before thinking about databases.
I got into pondering mail in databases from the issues pertaining to consistency of reads of directories in Unix filesystems. Whilst it is easy to guarantee the consistency of a read from an ACID style database (unlike reading directories in a big maildir folder). Of course when I asked Hans Reiser he said it sounds like the kind of modular functionality that modern filesystems ought to provide and offered to write a filesystem plugin for ReiserFS that guarantees the consistency of directory reads for maildir use. Of course there is a performance (or resource) penalty in doing a consistent read of a directory.
The issue is the same in both places, you either speed things up by allowing dirty reads or you take the performance hit by locking for the duration of all writes. When you create a new file you must atomically determine whether or not the name currently exists. Even resiser can't cheat on that without ending up corrupted.
Maybe more than one way to solve a problem, just need to make sure you know precisely which problems you are trying to solve.
Simon, who'll continue moving systems to maildir, till something better arrives.
An extended maildir might make sense where additional subdirectories are used transparently to limit the number of files in any single directory - so it would end up looking something like a squid cache which solves a very similar problem.
-- Les Mikesell lesmikesell@gmail.com
Les Mikesell wrote:
I think the people who expect an improvement from databases over maildir are used to unix filesystems that degrade badly as the number of files in a directory increase. These days many, like Reiserfs and XFS, are much better. My theory is that if your filesystem isn't a good place to store things you should fix that before thinking about databases.
Suppose you have a total of one million messages stored for 5000 users across 800 domains and you want to delete all message that were sent from a specific host. With MySQL it's a one line command and would take only a few seconds to execute. Using Maildir it would take hours because you would have to search every message. That's the power of MySQL.
Maildir only is fast for indexing file names. But if you are indexing across users and domains by host, or headers, or senders, or whatever then only a database can support these multiple indexes. There are things you can do with databases that are way beyond what you can imagine. Especially if you are integrating it with a spam system.
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
This issue isn't performance, it's power. If you compare it with what people are doing now then MySQL would probably be a little slower. But if you compare it to what is possible that you can't do now then MySQL wins.
Marc, I don't think anybody would dislike this option to be available, but at the same time it seems that all your examples of why this would be useful are corner cases, and most people would rather have Timo work on other things. So, given that Timo is nothing more than a slave for the majority :) please stop arguing your case and go and make your solution to your problems. If it ends up being useful, I'm sure it can find its way into Dovecot's codebase.
One suggestion. Because you keep talking about how this is not about performance but about power, if I were you I'd stop focusing so much on mySQL, and look start looking at more powerful implementations of SQL.
On Wed, 7 Jun 2006, Marc Perkel wrote:
Les Mikesell wrote:
I think the people who expect an improvement from databases over maildir are used to unix filesystems that degrade badly as the number of files in a directory increase. These days many, like Reiserfs and XFS, are much better. My theory is that if your filesystem isn't a good place to store things you should fix that before thinking about databases.
Suppose you have a total of one million messages stored for 5000 users across 800 domains and you want to delete all message that were sent from a specific host. With MySQL it's a one line command and would take only a few seconds to execute. Using Maildir it would take hours because you would have to search every message. That's the power of MySQL.
Maildir only is fast for indexing file names. But if you are indexing across users and domains by host, or headers, or senders, or whatever then only a database can support these multiple indexes. There are things you can do with databases that are way beyond what you can imagine. Especially if you are integrating it with a spam system.
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
This issue isn't performance, it's power. If you compare it with what people are doing now then MySQL would probably be a little slower. But if you compare it to what is possible that you can't do now then MySQL wins.
Ben wrote:
Marc, I don't think anybody would dislike this option to be available, but at the same time it seems that all your examples of why this would be useful are corner cases, and most people would rather have Timo work on other things. So, given that Timo is nothing more than a slave for the majority :) please stop arguing your case and go and make your solution to your problems. If it ends up being useful, I'm sure it can find its way into Dovecot's codebase.
One suggestion. Because you keep talking about how this is not about performance but about power, if I were you I'd stop focusing so much on mySQL, and look start looking at more powerful implementations of SQL.
Ben,
I'm not saying that Timo should do it now, and I'm not saying it should be limited to just MySQL. I'm just planting the idea for the future. I'm just at the stage of selling the idea. I think Timo needs to do the 1.0 first as well. I'm just laying out ideas for down the road.
Marc Perkel wrote:
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
It's called auto-whitelisting and smart spam scanners should do that.
Cheers, -jkt
-- cd /local/pub && more beer > /dev/mouth
Jan Kundrát wrote:
Marc Perkel wrote:
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
And maybe with 500 monkeys filtering email for a million years, this might actually happen.
The cases where it almost happens and is useful information are not easy to distinguish from the cases where it almost happens by mistake. Distinguishing the difference between the cases cannot be done with sql.
Ken A Pacific.Net
Cheers, -jkt
Jan Kundrát wrote:
Marc Perkel wrote:
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
It's called auto-whitelisting and smart spam scanners should do that.
actually, auto white listing is any one of a number of techniques used to eliminate false positives from "known parties". I use one in camram where anyone you send e-mail to is automatically white listed. To distinguish that from the often confusing auto white listing terminology, I call it "friends list". It works exceedingly well and haven't had any significant problems even when the site has been infected with zombies. With any automatic white listing tool, you need the human feedback which says "this is spam". The human feedback enables automatic elimination of the entry from the auto white list, and blacklisting the IP address the message came from (you did preserve the source IP address as a new header in the message, didn't you?).
The analysis techniques suggested originally is classically naïve. A technique I'm playing with that appears to work much better is to use the output of the content filter to predict whether a message is good or bad. all of the bad messages are placed into a dumpster and expired after five days. If a message is left in the dumpster, the IP address is listed as a "bad source".
Any messages that passes the content filter, friends filter, or spam filter is recorded as "good source". If the ratio of good source to bad source drops below 80%, the site is listed as contaminated and automatically dumped in the spam trap for human analysis. If the ratio drops below 40%, it's listed as spam and all messages are brown listed.
the main downside of this technique is that it does increase the workload for the user (more content in the spam trap) and it does seem to work better if you have multiple sources for feeding the good/bad ratio analysis
my two cents worth.
Hello Marc,
Wednesday, June 7, 2006, 8:29:33 PM, you wrote:
MP> This issue isn't performance, it's power.
Here are one practical problem: indexes (read: database schema), which aare good for your tasks will be totally useless fot other one tasks. So, if somebody, but you, write mySQL/PgSQL/WhateverSQL storage for dovecot, here will be GREAT chance, that this storage doesn't allow you to solve your power' taskw without deep hacking, kludges, etc. And
very generic schema, good for any imaginable query' means abstract spherical horse in vacuum* not useful for any real-world task'. And it also means
terrible performance even for non-power every-day tasks', too.
[*] This is pharse from russian joke (may be not only russian, I don't known) which looks like:
Three scientists win grants for prediction results of trots. They are PgD of biology, PhD of statistic and PhD of theoretica physic. After year of work, organisation, which provide grants, requests for results.
Biologist: I examined many trotters, good ones and bad ones, and found some rules, which allow predict result of trots by physical conditions of horses. Statistician: I examined many results of previous trots and found some rules, which allow predict results of future trots by previous results of horses. Physicist: I need more time. But now I have model of spherical horse in vacuum.
-- Best regards, Lev mailto:lev@serebryakov.spb.ru
Marc Perkel wrote:
Suppose you have a total of one million messages stored for 5000 users across 800 domains and you want to delete all message that were sent from a specific host. With MySQL it's a one line command and would take only a few seconds to execute. Using Maildir it would take hours because you would have to search every message. That's the power of MySQL.
And really, her we come to the major advantage of using an SQL driven database: Ad-hoc operations.
I recall a tale my father ( a veteran of 30 years in IBM ) told me of when IBM started rolling out SQL databases. The clients used to wail loudly about how _slow_ they ran. IBM said they'd pull them out, and go back to the old ways.
The clients wouldn't let them. The ability to do arbitrary and ad-hoc queries, and to build customised analyses on the fly, far outweighed the performance hit.
And this seems to be the crux of your argument. With an SQL database, you can create new and exciting queries on your data easily. However, expect to take a hit on the common case operations.
Maildir only is fast for indexing file names. But if you are indexing across users and domains by host, or headers, or senders, or whatever then only a database can support these multiple indexes. There are things you can do with databases that are way beyond what you can imagine. Especially if you are integrating it with a spam system.
Please stop using the word "database" as if it means "SQL DBMS". It's as bad as those marketing twits using the word "broadband" to mean "high data rate".
And again here we see my point. When you want to do the common case, the existing, job specific databases (Maildir and mbox) are good at it. Why? Because they were designed to be.
But when you want to do something else, they're not. Why? It was the trade off made to make the common case fast and easy. I don't spend a lot of time preparing my home for if the Queen comes to visit.... it's not what I'd call a common case, and my time would be better spent elsewhere.
This issue isn't performance, it's power. If you compare it with what people are doing now then MySQL would probably be a little slower. But if you compare it to what is possible that you can't do now then MySQL wins.
In fact, for some of the stuff I do at work, I'm planning to write a tool that will munge some of our job specific databases into SQLite, so we can more easily to arbitrary analyses on the data when we need to. But it'd be a waste to put it all in a SQL DBMS from the start.
-- Curtis Maloney cmaloney@cardgate.net
On Wed, 2006-06-07 at 09:29 -0700, Marc Perkel wrote:
Les Mikesell wrote:
I think the people who expect an improvement from databases over maildir are used to unix filesystems that degrade badly as the number of files in a directory increase. These days many, like Reiserfs and XFS, are much better. My theory is that if your filesystem isn't a good place to store things you should fix that before thinking about databases.
Suppose you have a total of one million messages stored for 5000 users across 800 domains and you want to delete all message that were sent from a specific host. With MySQL it's a one line command and would take only a few seconds to execute. Using Maildir it would take hours because you would have to search every message. That's the power of MySQL.
And with great power comes great frustration: Instead of taking 50msec to fetch a message, it takes 80msec, and instead of taking 25msec to load a message into a mailbox, it now takes 200-500msec!
Note that what you REALLY want to do is create a smart index of everything- it exists, it's called ZOE. It's not fast, and I think in practice, it's not all that much fun to use, but it does EXACTLY what you're talking about- builds a relational model of incoming email.
It's Slow. REALLY slow. And don't fool yourself into thinking it's slow because it's written in Java. It's slow because fulltext loading messages is slow.
Maildir only is fast for indexing file names. But if you are indexing across users and domains by host, or headers, or senders, or whatever then only a database can support these multiple indexes. There are things you can do with databases that are way beyond what you can imagine. Especially if you are integrating it with a spam system.
No. The cost is GREATER for MySQL than it is for Maildir because MySQL has to spend the time to index those things whether they're used or not. Building an index on everything is slow- you spend all your time maintaining indexes. Most of the time they're not needed- and for the cases where they're interesting, a specialized solution is always faster.
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
Sender-checks are useless- most junk mail in my mailboxes come from random addresses. Language classifiers don't use MySQL as a backend because MySQL is slow. Sparse fields are much faster, and not at all difficult to implement.
This issue isn't performance, it's power.
Really?
Tue, 06 Jun 2006 05:38:41 -0700 (08:38 EDT)
Seems to me that MySQL could somplify all the backend stuff that everyone struggles with and with replication one could create very massive and reliable systems
Tue, 06 Jun 2006 16:40:27 -0700 (19:40 EDT)
Not true for "always". If you have 100,000 messages in a folder the database will win easy.
It seems to me that this is poorly thought out with no real goals in mind. It seems like a way to "keep your options open" - except you neglect that doing so incurs a great cost. I recommend you take a look at ZOE and think a little more if you want all email to work that way.
If you compare it with what people are doing now then MySQL would probably be a little slower.
No, it will definitely be a lot slower.
But if you compare it to what is possible that you can't do now then MySQL wins.
I'm not interested in doing "what's possible"- and in fact, the vast majority of people aren't either. They're interested in solving problems, or having the problems solved for them.
Please go look at ZOE and DBMail and decide if you still think this is a good idea after using these-- maybe it will help you decide exactly what you want to do- and perhaps if you do decide to "make your own LDA" it'll help you decide what's wrong with them- so you don't repeat their mistakes.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
And with great power comes great frustration: Instead of taking 50msec to fetch a message, it takes 80msec, and instead of taking 25msec to load a message into a mailbox, it now takes 200-500msec!
I think the bottom line is that the original poster is very fond of MySQL - which is actually a great database, among others. But I fear he underestimates the load that a large mailserver would impose on any classical database system - many indexes would have to be regenerated all the time because with every incoming mail the fundamental data changes. I get his idea but fear that storing everything away to SQL is the wrong solution; we would end up putting large smtp proxies before the server in order to keep it running. This is a very different situation from a SQL driven http server where even in a shop system most data structures and their indexes remain constant most of the time. Another point is that it took the community years to produce a production-ripe version of a http server with DB backend. I think for the dovecot community this is not feasible at the moment. If there is a group of people willing to write a sort of plugin to support SQL, probably nobody here will object. But in the meantime I would like to see this list coming back to the more pressing thing of getting 1.0 out & running.
Yours, Jakob Curdes
On Thu, 2006-06-08 at 21:32 +0200, Jakob Curdes wrote:
And with great power comes great frustration: Instead of taking 50msec to fetch a message, it takes 80msec, and instead of taking 25msec to load a message into a mailbox, it now takes 200-500msec!
I think the bottom line is that the original poster is very fond of MySQL - which is actually a great database, among others. But I fear he underestimates the load that a large mailserver would impose on any classical database system - many indexes would have to be regenerated all the time because with every incoming mail the fundamental data changes. I get his idea but fear that storing everything away to SQL is the wrong solution; we would end up putting large smtp proxies before the server in order to keep it running. This is a very different situation from a SQL driven http server where even in a shop system most data structures and their indexes remain constant most of the time. Another point is that it took the community years to produce a production-ripe version of a http server with DB backend. I think for the dovecot community this is not feasible at the moment. If there is a group of people willing to write a sort of plugin to support SQL, probably nobody here will object. But in the meantime I would like to see this list coming back to the more pressing thing of getting 1.0 out & running.
+1
thanks for saying this :-)
-- BestSolution.at EDV Systemhaus GmbH http://www.bestsolution.at
On Thu, 2006-06-08 at 21:32 +0200, Jakob Curdes wrote:
If there is a group of people willing to write a sort of plugin to support SQL, probably nobody here will object.
Here's a plugin that I started a long time ago. Doesn't do much, probably won't compile anymore and I've no plans to update it anytime soon:
On Thu, 2006-06-08 at 22:37 +0300, Timo Sirainen wrote:
On Thu, 2006-06-08 at 21:32 +0200, Jakob Curdes wrote:
If there is a group of people willing to write a sort of plugin to support SQL, probably nobody here will object.
Here's a plugin that I started a long time ago. Doesn't do much, probably won't compile anymore and I've no plans to update it anytime soon:
Oh, and my even older (like 3 years) Oracle storage:
Timo Sirainen wrote:
On Thu, 2006-06-08 at 21:32 +0200, Jakob Curdes wrote:
If there is a group of people willing to write a sort of plugin to support SQL, probably nobody here will object.
Here's a plugin that I started a long time ago. Doesn't do much, probably won't compile anymore and I've no plans to update it anytime soon:
Thanks Timo. I'm not a C programmer and I agree that version 1.0 should come first. But - the future of email is in database storage. You might as well get there first.
Here's an interesting overview of something called DBMail. This is along the lines of what I have in mind. but it could use a Dovecot front end.
Marc Perkel wrote:
Here's an interesting overview of something called DBMail. This is along the lines of what I have in mind. but it could use a Dovecot front end.
Ummm... are you not even reading the responses to this (your own) thread?
A reference to DBMail was among the first responses, and there have been others.
--
Best regards,
Charles
On Saturday, June 10, 2006 10:07 AM -0400 Charles Marcus CMarcus@Media-Brokers.com wrote:
A reference to DBMail was among the first responses, and there have been others.
Has anyone compiled a comparison of Dovecot to DBMail? Why would I chose one over the other?
On Mon, 2006-06-12 at 18:12 -0700, Kenneth Porter wrote:
On Saturday, June 10, 2006 10:07 AM -0400 Charles Marcus CMarcus@Media-Brokers.com wrote:
A reference to DBMail was among the first responses, and there have been others.
Has anyone compiled a comparison of Dovecot to DBMail? Why would I chose one over the other?
I think their goals are quite different. Don't know if any such comparisons would be all that useful.
Or I guess I can give you one difference: Dovecot tries very hard to be secure. DBMail then seems to keep adding SQL injection security holes. I said about this to them a few years ago and they fixed them, but now that I looked at the code a few months ago they had added more of those.
--On Tuesday, June 13, 2006 9:43 AM +0300 Timo Sirainen tss@iki.fi wrote:
Or I guess I can give you one difference: Dovecot tries very hard to be secure. DBMail then seems to keep adding SQL injection security holes. I said about this to them a few years ago and they fixed them, but now that I looked at the code a few months ago they had added more of those.
Ouch, that's bad. Thanks for the heads-up.
On Wed, 2006-06-07 at 09:29 -0700, Marc Perkel wrote:
Les Mikesell wrote:
I think the people who expect an improvement from databases over maildir are used to unix filesystems that degrade badly as the number of files in a directory increase. These days many, like Reiserfs and XFS, are much better. My theory is that if your filesystem isn't a good place to store things you should fix that before thinking about databases.
Suppose you have a total of one million messages stored for 5000 users across 800 domains and you want to delete all message that were sent from a specific host. With MySQL it's a one line command and would take only a few seconds to execute. Using Maildir it would take hours because you would have to search every message. That's the power of MySQL.
And with great power comes great frustration: Instead of taking 50msec to fetch a message, it takes 80msec, and instead of taking 25msec to load a message into a mailbox, it now takes 200-500msec!
Note that what you REALLY want to do is create a smart index of everything- it exists, it's called ZOE. It's not fast, and I think in practice, it's not all that much fun to use, but it does EXACTLY what you're talking about- builds a relational model of incoming email.
It's Slow. REALLY slow. And don't fool yourself into thinking it's slow because it's written in Java. It's slow because fulltext loading messages is slow.
Maildir only is fast for indexing file names. But if you are indexing across users and domains by host, or headers, or senders, or whatever then only a database can support these multiple indexes. There are things you can do with databases that are way beyond what you can imagine. Especially if you are integrating it with a spam system.
No. The cost is GREATER for MySQL than it is for Maildir because MySQL has to spend the time to index those things whether they're used or not. Building an index on everything is slow- you spend all your time maintaining indexes. Most of the time they're not needed- and for the cases where they're interesting, a specialized solution is always faster.
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
Sender-checks are useless- most junk mail in my mailboxes come from random addresses. Language classifiers don't use MySQL as a backend because MySQL is slow. Sparse fields are much faster, and not at all difficult to implement.
This issue isn't performance, it's power.
Really?
Tue, 06 Jun 2006 05:38:41 -0700 (08:38 EDT)
Seems to me that MySQL could somplify all the backend stuff that everyone struggles with and with replication one could create very massive and reliable systems
Tue, 06 Jun 2006 16:40:27 -0700 (19:40 EDT)
Not true for "always". If you have 100,000 messages in a folder the database will win easy.
It seems to me that this is poorly thought out with no real goals in mind. It seems like a way to "keep your options open" - except you neglect that doing so incurs a great cost. I recommend you take a look at ZOE and think a little more if you want all email to work that way.
If you compare it with what people are doing now then MySQL would probably be a little slower.
No, it will definitely be a lot slower.
But if you compare it to what is possible that you can't do now then MySQL wins.
I'm not interested in doing "what's possible"- and in fact, the vast majority of people aren't either. They're interested in solving problems, or having the problems solved for them.
Please go look at ZOE and DBMail and decide if you still think this is a good idea after using these-- maybe it will help you decide exactly what you want to do- and perhaps if you do decide to "make your own LDA" it'll help you decide what's wrong with them- so you don't repeat their mistakes.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
Hello Marc,
Tuesday, June 6, 2006, 7:25:25 PM, you wrote:
MP> I'm suggesting it in addition to MBOX and MAILDIR. And of course if MP> there's a MySQL version then other databases will follow. Just seems to MP> me that if I were running a really BIG email operation that MySQL could MP> have some serious benefits. I don't think it will be true. I've read articles by Vladimir Butenko (author of CommuniGate PRO, very scalable and hight-performace mail solution), that "generic SQL" is not give too much benefits on average-loaded sites and limit performace of hight-loaded sites. As any 'generic' solution.
E-mails are like objects, not relations, so you have two approaches, really:
(a) Try to emulate objects on tables, like some OO2RDBMS wrappers does. Base is perfectly normalized, all strings are stored only once in separate tables (like 'headenames' and 'headervalues'), etc. Result is very massive JOINs and poor performance.
(b) Store whole e-mail as BLOBs + some indexes by main headers. Result is very high IO-load on RDBMS, which need to retrive and store large continous objects.
Best solution seems to store e-mails as-is, one e-mail per file, and store some indexes in simple low-level database, like BerkeleyDB. Filesystem does best in working with "BLOBS" and simple database engine without complex query langauge allows to have VERY fast indexes. This solution has one additional advantage: all indexes can be rebuild by e-mails, if DB is backed up with errors, for example.
-- Best regards, Lev mailto:lev@serebryakov.spb.ru
On Tue, 6 Jun 2006, Lev Serebryakov wrote:
Best solution seems to store e-mails as-is, one e-mail per file, and store some indexes in simple low-level database, like BerkeleyDB. Filesystem does best in working with "BLOBS" and simple database engine without complex query langauge allows to have VERY fast indexes.
I have nothing but trouble using OpenLDAP's fast BerkleyDB implementation. I agree that BLOBs in a filesystem are much faster than the overhead required to pull it from a DB, esp. because you can sent the data right to the client. (I don't know if you can instruct a DB to send BLOBs right into a TCP channel).
However, when it comes to discussion to take a message apart and store its parts sharing them, it would be a matter of benchmarking: Can you still read the part from the filesystem and pass it forth directly?
I also don't believe that SQL is a complex language per se, because it offers complex and slow stuff; there has been a pletoria of high- and low-grade programmers, engineers and theoreticers, who built good algorithms and optimization strategies for SQL, nobody can re-implement easily.
I also guess that it would make no sense for Dovecot to utilize both backends, I mean: if you go the SQL way half-heartedly, it worse and nobody will use it happily, but then you will loose focus for the filesystem based storage and lots of people don't want to install a DB. There is a discussion about this very same topic in OpenLDAP: Summary (view of mine :): the SQL backend is lots quicker, but OpenLDAP's useage of it is bad, because written for BerkleyDB, that the performance is lower; one needs to rewrite to much code and afterwards the SQL DB is quick, but BDB slow.
This solution has one additional advantage: all indexes can be rebuild by e-mails, if DB is backed up with errors, for example.
IMO: This is one thing, that needs to be done for Dovecot, e.g. keywords, and a toolset.
-- Steffen Kaiser
I don't want to comment on the technology, but the idea of indexing
to me is the most interesting.
For a commercial client we are using Zimbra - and with one mail box
containing about 12 GB of mail I can search it for practically any
word, and even limit that to particular headers, or file types, or
content in a file (the list goes on) in sub-1-second. I also have
virtual mail folders that can use that same search and present those
folders as IMAP folders - very cool.
Now that is the killer app.
Mind you - Zimbra is just too big and slow - great for commercial
customers where you have maybe 200 or so email accounts on a box, but
I am sure we are more used to running systems with 10,000 users per
box - Zimbra would not cope (well... java...).
Scott
Here's an example of what I would do with a MySQL backend.
I have an inbox where I get email for a wide variety of subjects. One of those classifications are customers. So I would create a customers folder and have a rule that if an incoming email from address match any email address in my customers folder then the mail would be delivered there. That way if I get a new customer then all I have to do is drag the first message into the customers folder and all other mail from that person would go into that folder.
To do this with MySQL would be trivial. To do it without a database isn't possible.
Marc Perkel wrote:
To do this with MySQL would be trivial. To do it without a database isn't possible.
This has nothing to do with the backend and everything to do with the Local Delivery Agent. Whether you are using a database backend or a filesystem backend, the LDA would have to be able to query an index for your functionality to be possible. I can easily write a script to update local user configuration for for a Maildrop, procmail, or SIEVE filter to do exactly what you want, without relying on a database at all.
John
-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748
John Peacock wrote:
Marc Perkel wrote:
To do this with MySQL would be trivial. To do it without a database isn't possible.
This has nothing to do with the backend and everything to do with the Local Delivery Agent. Whether you are using a database backend or a filesystem backend, the LDA would have to be able to query an index for your functionality to be possible. I can easily write a script to update local user configuration for for a Maildrop, procmail, or SIEVE filter to do exactly what you want, without relying on a database at all.
John
You can't write a script to do the kinds of things I want to do because what I want hasn't ever been done before and it requires a database to do it.
I would probably create my owl LDA using MySQL commands. Feed it from Exim.
Marc Perkel wrote:
You can't write a script to do the kinds of things I want to do because what I want hasn't ever been done before and it requires a database to do it.
You both underestimate my programming skills and overestimate the capabilities of general purpose databases. Feel free to go off somewhere and develop your database backend then come back here and announce it. Your constant naysaying is getting tiresome and is definitely off topic here. I can only speak for myself, but I want Timo to focus his energy on getting 1.0 gold, not on pointless digressions about bluesky schemes...
John
-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748
John Peacock wrote:
what I want hasn't ever been done before
this part especially made me laugh.
capabilities of general purpose databases. Feel free to go off somewhere and develop your database backend then come back here and announce it. Your constant naysaying is getting tiresome and is definitely off topic here. I can only speak for myself, but I want Timo to focus his energy on getting 1.0 gold, not on pointless digressions about bluesky schemes...
seconded.
Backends could be implemented as plugins and that'd be surely something very nice, but I think right now, with 1.0 upcoming, is not the time for big feature additions. Software development is not done by coming up with fancy ideas.
Here's an example of what I would do with a MySQL backend.
I have an inbox where I get email for a wide variety of subjects. One of those classifications are customers. So I would create a customers folder and have a rule that if an incoming email from address match any email address in my customers folder then the mail would be delivered there. That way if I get a new customer then all I have to do is drag the first message into the customers folder and all other mail from that person would go into that folder.
To do this with MySQL would be trivial. To do it without a database isn't possible.
You can do this with mail rules.
nodata wrote:
Here's an example of what I would do with a MySQL backend.
I have an inbox where I get email for a wide variety of subjects. One of those classifications are customers. So I would create a customers folder and have a rule that if an incoming email from address match any email address in my customers folder then the mail would be delivered there. That way if I get a new customer then all I have to do is drag the first message into the customers folder and all other mail from that person would go into that folder.
To do this with MySQL would be trivial. To do it without a database isn't possible.
You can do this with mail rules.
But you would have to add a rule for each new customer. What I'm talking about is that the presence of email in the folder would determine what new email would be delivered into that folder - without writing a rule. Dragging a message into the folder makes all future message from that sender go to that folder.
On Wed, 2006-06-07 at 08:22 -0700, Marc Perkel wrote:
nodata wrote:
Here's an example of what I would do with a MySQL backend.
I have an inbox where I get email for a wide variety of subjects. One of those classifications are customers. So I would create a customers folder and have a rule that if an incoming email from address match any email address in my customers folder then the mail would be delivered there. That way if I get a new customer then all I have to do is drag the first message into the customers folder and all other mail from that person would go into that folder.
To do this with MySQL would be trivial. To do it without a database isn't possible.
You can do this with mail rules.
But you would have to add a rule for each new customer.
I would most certainly not have to add a rule for each new customer.
I have these in some of my .qmail files:
|/var/qmail/bin/maildirmake .maildir/.Archive-date +%Y-%m
>/dev/null
2>&1;exit 0
|/var/qmail/bin/maildir ./.maildir/.Archive-date +%Y-%m
./.maildir/
Are you going to tell me that in order to sort my data by date I have to create a rule for each date that I'm interested in?
Are you going to tell me there's any reason I couldn't have used
"$SENDER" instead of date +%Y-%m
?
What I'm talking about is that the presence of email in the folder would determine what new email would be delivered into that folder - without writing a rule. Dragging a message into the folder makes all future message from that sender go to that folder.
That's a really poorly thought-out user-interface. When people use their IMAP clients and move one message into a folder, they do not expect potentially a thousand more messages get moved along with it.
In any event, that kind of magic doesn't require SQL. It can be done with some folder hooks, and a little bit of thinking.
My email client allows me to have a kind of view or virtual folder based on some criteria, and automation allows me to create those views automatically.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
On Wed, 2006-06-07 at 08:22 -0700, Marc Perkel wrote:
nodata wrote:
Here's an example of what I would do with a MySQL backend.
I have an inbox where I get email for a wide variety of subjects. One of those classifications are customers. So I would create a customers folder and have a rule that if an incoming email from address match any email address in my customers folder then the mail would be delivered there. That way if I get a new customer then all I have to do is drag the first message into the customers folder and all other mail from that person would go into that folder.
To do this with MySQL would be trivial. To do it without a database isn't possible.
You can do this with mail rules.
But you would have to add a rule for each new customer.
I would most certainly not have to add a rule for each new customer.
I have these in some of my .qmail files:
|/var/qmail/bin/maildirmake .maildir/.Archive-date +%Y-%m
>/dev/null
2>&1;exit 0
|/var/qmail/bin/maildir ./.maildir/.Archive-date +%Y-%m
./.maildir/
Are you going to tell me that in order to sort my data by date I have to create a rule for each date that I'm interested in?
Are you going to tell me there's any reason I couldn't have used
"$SENDER" instead of date +%Y-%m
?
What I'm talking about is that the presence of email in the folder would determine what new email would be delivered into that folder - without writing a rule. Dragging a message into the folder makes all future message from that sender go to that folder.
That's a really poorly thought-out user-interface. When people use their IMAP clients and move one message into a folder, they do not expect potentially a thousand more messages get moved along with it.
In any event, that kind of magic doesn't require SQL. It can be done with some folder hooks, and a little bit of thinking.
My email client allows me to have a kind of view or virtual folder based on some criteria, and automation allows me to create those views automatically.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
Marc Perkel wrote:
To do this with MySQL would be trivial. To do it without a database isn't possible. My MUA has ability to do exactly this... In any case, it isn't possible without some server-side (or client-side) client program -- MUA or LDA (because dovecot doesn't have any filtering abilities by itself). I don't see, why it is not possible to write such LDA/pseudo-MUA without database. Simplest way is to generate sieve/procmail script...
-- // Lev Serebryakov
On Wednesday 07 Jun 2006 15:32, Lev A. Serebryakov wrote:
Marc Perkel wrote:
To do this with MySQL would be trivial. To do it without a database isn't possible.
My MUA has ability to do exactly this... In any case, it isn't possible without some server-side (or client-side) client program -- MUA or LDA (because dovecot doesn't have any filtering abilities by itself). I don't see, why it is not possible to write such LDA/pseudo-MUA without database. Simplest way is to generate sieve/procmail script...
I think Marc exaggerates when he says "isn't possible", what he means is difficult, messy, and/or inefficient.
Doing filtering like this in the MUA is hideous anyway, as it locks you to a MUA (not to say I don't have a copy of thunderbird sorting my IMAP folders all day long on another machine, but it isn't a robust approach), which is why a standard for server side mail filtering like SIEVE is so desirable.
However I think Marc has given enough example of things that would likely be easier with a database. As I said I like the idea of mail in databases, just some people seems to have unrealistic hopes for what it might achieve. I also think building "brains" into the mail system is likely the only long term solution to spam (and spam like problems -- ignoring fixing the large base of compromised PCs out there).
Lev A. Serebryakov wrote:
Marc Perkel wrote:
To do this with MySQL would be trivial. To do it without a database isn't possible. My MUA has ability to do exactly this... In any case, it isn't possible without some server-side (or client-side) client program -- MUA or LDA (because dovecot doesn't have any filtering abilities by itself). I don't see, why it is not possible to write such LDA/pseudo-MUA without database. Simplest way is to generate sieve/procmail script...
The complex delivery logic would be in writing my own LDA to do the fancy stuff. I just need Dovecot to read it once it's in the database.
On Tue, 2006-06-06 at 08:25 -0700, Marc Perkel wrote:
I'm suggesting it in addition to MBOX and MAILDIR. And of course if there's a MySQL version then other databases will follow. Just seems to me that if I were running a really BIG email operation that MySQL could have some serious benefits.
Using a "big ol' SQL database" will ALWAYS give _worse performance_ than a specialized solution (like dbox), and will usually give _worse performance_ than a naive but still specialized solution (like maildir or mbox).
SQL's strengths are mutability, not performance- regardless of what you or others might think. USUALLY the database performance is "good enough" for most applications- but that doesn't mean it's even remotely close to optimal.
Nevertheless, DBMail exists- it's an SQL backed IMAP server, and its active- and supports MySQL (in addition to SQLite and PostgreSQL).
It's not as fast as dovecot at many things (as expected), but it's fun to play with, and I've used it to experiment with new mail access methods- i.e. non-IMAP access to messages.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
Geo Carncross wrote:
On Tue, 2006-06-06 at 08:25 -0700, Marc Perkel wrote:
I'm suggesting it in addition to MBOX and MAILDIR. And of course if there's a MySQL version then other databases will follow. Just seems to me that if I were running a really BIG email operation that MySQL could have some serious benefits.
Using a "big ol' SQL database" will ALWAYS give _worse performance_ than a specialized solution (like dbox), and will usually give _worse performance_ than a naive but still specialized solution (like maildir or mbox).
Not true for "always". If you have 100,000 messages in a folder the database will win easy.
SQL's strengths are mutability, not performance- regardless of what you or others might think. USUALLY the database performance is "good enough" for most applications- but that doesn't mean it's even remotely close to optimal.
If you have a large system then what you might want is power and if you want speed you just spend more money for faster hardware.
Nevertheless, DBMail exists- it's an SQL backed IMAP server, and its active- and supports MySQL (in addition to SQLite and PostgreSQL).
I'd like to see Dovecot have that option as well.
On Tue, 2006-06-06 at 16:40 -0700, Marc Perkel wrote:
I'm suggesting it in addition to MBOX and MAILDIR. And of course if there's a MySQL version then other databases will follow. Just seems to me that if I were running a really BIG email operation that MySQL could have some serious benefits.
Using a "big ol' SQL database" will ALWAYS give _worse performance_ than a specialized solution (like dbox), and will usually give _worse performance_ than a naive but still specialized solution (like maildir or mbox).
Not true for "always". If you have 100,000 messages in a folder the database will win easy.
On what operations? Suppose you want to delete a message out of the middle and release the file space back to the OS? That's pretty quick in a filesystem like Reiserfs.
-- Les Mikesell lesmikesell@gmail.com
Les Mikesell wrote:
On what operations? Suppose you want to delete a message out of the middle and release the file space back to the OS? That's pretty quick in a filesystem like Reiserfs. DBs have DB spaces, so, there is no need to release the space used by 1 email. If you have a big mailserver, releasing the space of a single email is not that important.
But, let not discuss perfomance in SQL, is just an old old issue.
Oliver
-- Oliver Schulze L. oliver@samera.com.py
Les Mikesell wrote:
On Tue, 2006-06-06 at 16:40 -0700, Marc Perkel wrote:
On what operations? Suppose you want to delete a message out of the middle and release the file space back to the OS? That's pretty quick in a filesystem like Reiserfs.
Suppose I want to move all message from a specific host to another folder. UPDATE table SET folder=new-folder WHERE host=value;
With databases you can do things you would have to write complex programs to do with files. Things that you would consider to be impossible are trivial with databases.
This outside the box.
On Tue, 2006-06-06 at 17:29 -0700, Marc Perkel wrote:
Using a "big ol' SQL database" will ALWAYS give _worse performance_ than a specialized solution (like dbox), and will usually give _worse performance_ than a naive but still specialized solution (like maildir or mbox).
Not true for "always".
Yes true for always.
It might be easier for your to accept if you read that as "a specialized database can always perform better than a generic one" and you'll probably understand better.
If you have 100,000 messages in a folder the database will win easy.
The argument is not against databases, it's against SQL. dbox is a database. maildir is a database.
Les Mikesell wrote:
On Tue, 2006-06-06 at 16:40 -0700, Marc Perkel wrote:
On what operations? Suppose you want to delete a message out of the middle and release the file space back to the OS? That's pretty quick in a filesystem like Reiserfs.
Suppose I want to move all message from a specific host to another folder. UPDATE table SET folder=new-folder WHERE host=value;
Since we're being completely hypothetical here:
echo new-folder > ~/hostmap/value.tmp &&
mv ~/hostmap/value.tmp ~/hostmap/value
With databases you can do things you would have to write complex programs to do with files.
Change that to SQL, and I agreed: Read my email again. See the part about mutability? SQL is good for that.
Things that you would consider to be impossible are trivial with databases.
No. Things that YOU would consider to be impossible, perhaps. But note here that SQL != database. dbox is a kind of database- one that's designed to specialize in email storage.
-- Internet Connection High Quality Web Hosting http://www.internetconnection.net/
With MySQL for storage you can use replicated databases and multiple machines running dovecot IMAP. The "filesystem" don't have the capabilities that MySQL has.
On 06-06-2006 08:19:35 -0700, Marc Perkel wrote:
With MySQL for storage you can use replicated databases and multiple machines running dovecot IMAP. The "filesystem" don't have the capabilities that MySQL has.
RAID? Distributed RAID? (i.e. infiniband?) Also, how much does a centralised approach such as MySQL's replication hierarchy help? IMAP is a lot about writing the data (marking as read, flagging, removing, adding, moving, etc.) so the master server will have to do all the work. How much "delay" do you allow on the slaves? Is there any use for a read-only slave in an IMAP world other than a mailing list archive or fallback? rsync can do wonders for those too. You would need at least a PostgreSQL multi-master structure to for instance manage geographically bound data... but do users share mailboxes over the world and use only a certain part (folder?) of it on a regular base?
The idea is tempting, because of the possible reuse and benefit from an area that has had serious research since the 1950's. However, I think it is at this stage really questionable whether the overhead of a true DBMS as storage backend is justifyable by any real benefits gained. Unfortunately.
Hehe, my 2 completely off-topic euro cents.
-- Fabian Groffen MonetDB developer - http://monetdb.cwi.nl/
Am Dienstag, den 06.06.2006, 08:19 -0700 schrieb Marc Perkel:
With MySQL for storage you can use replicated databases and multiple machines running dovecot IMAP. The "filesystem" don't have the capabilities that MySQL has.
Yes it does, you just have to choose a filesystem designed to be shared. Like NFS, GFS, etc.
MySQL can't scale for writes beyond a single server, unless your application knows which server has which data.
On Tue, 2006-06-06 at 05:38 -0700, Marc Perkel wrote:
Have you considered the idea of storing all the email in a MySQL database? Seems to me that MySQL could somplify all the backend stuff that everyone struggles with and with replication one could create very massive and reliable systems. What would it take to use MySQL that way?
I'm not against Dovecot supporting SQL backend. I've even started coding it a couple of times. But I don't think it's really all that great idea, at least from performance point of view. Dovecot's current indexes are highly optimized for what IMAP server needs to do, and SQL just doesn't support many such features in any efficient way.
As for faster searching, the problem with that is still the same as I mentioned a month or so ago: A fully compliant IMAP server can't use any normal full text search indexers to implement SEARCH command. I'm still going to add some search extension which does allow doing this and Dovecot will have fast full text search support, without requiring a SQL database.
An SQL database might be a good idea if your client (eg. webmail) used it directly and the database's schema was specifically designed for the client. But for IMAP server which needs to be able to do many things and restricted by the way the IMAP RFC requires things to be done, SQL just sucks.
Also I think others already mentioned that MySQL's replication allows only read-only slaves, which I don't think helps all that much. As for multi-master replication, what databases really support that? I think with Oracle it supports it only when the filesystem itself is already shared between the computers, and Dovecot already supports running in shared filesystems, so Oracle would be just extra overhead there.
On Wednesday 07 Jun 2006 12:44, Timo Sirainen wrote:
As for multi-master replication, what databases really support that? I think with Oracle it supports it only when the filesystem itself is already shared between the computers, and Dovecot already supports running in shared filesystems, so Oracle would be just extra overhead there.
Oracle supports (well version 7 and 8) support a multimaster replication strategy, the only requirement was reasonable network connection between instances. It uses a distributed transaction model, so the masters just queue up transactions which are then pushed by existing Oracle mechanisms to send remote updates to other servers.
For email, which has a relatively simple state model, and mostly small updates (excluding "new email"), delete message, mark read, mark replied, change folder attribute, it could be made to work well (at least till people are different sites file the same email in different folders at the same time ;-). It would be simple also to do a read mostly copy, so that say email attachments are mirrored readonly at farflung sites, removing the multimaster requirement, but still speeding up the "read the email with big attachment" (which for most wide area networks is the main pain -- Lotus Notes anyone?)
Not sure what the point would be of multimaster rep for email, as it is likely to push up the resource requirements. I don't see a scenario where it is likely to help, unless you have a lot of email boxes that are shared at two sites with poor connectivity between them, but which change slowly. Lotus Notes has to be better at something - possibly ;)
We used Oracle replication to push the active master closer to the end user, for a 'follow the sun' helpdesk, it was probably more trouble than it was worth even so. But I did get to do the Oracle Advanced Replication course, which was educational. Mostly I learnt that multimaster replication can be deployed with simple and robust tools, and that from an application design perspective trying to use it is almost always a mistake since it introduces various subtle constraints on the applications use of the database, and can cause you to have a lot of out of sync databases in various far flung corners of the world, and spend a lot of time and effort getting them all back in sync.
On Wed, 2006-06-07 at 13:23 +0100, Simon Waters wrote:
On Wednesday 07 Jun 2006 12:44, Timo Sirainen wrote:
As for multi-master replication, what databases really support that? I think with Oracle it supports it only when the filesystem itself is already shared between the computers, and Dovecot already supports running in shared filesystems, so Oracle would be just extra overhead there.
Oracle supports (well version 7 and 8) support a multimaster replication strategy, the only requirement was reasonable network connection between instances. It uses a distributed transaction model, so the masters just queue up transactions which are then pushed by existing Oracle mechanisms to send remote updates to other servers.
Oh. I guess there are then multiple ways to do it. I remember seeing also that Oracle-on-top-of-OracleFS way also.
BTW. I've thought about doing something similar with Dovecot also. Dovecot already writes transaction logs which contains pretty much all the changes that Dovecot sees, except for new mails it only contains "Appended UID 5 with flags \Seen".
So I've been thinking about the possibility of sending all those transactions also through network to some other computer, and in appends also sending the actual mail contents. That'd make replication somewhat easy I think.
For multi-master operation the only difficult part is how to allocate UIDs for messages. They always need to be ascending, so there would have to be a global pool where they're allocated from. In case the global pool isn't available, it gets trickier. I guess the UIDs should somehow be temporarily allocated and when all masters are again connected, the conflicting UIDs would have to be reassigned to new UIDs.
On Wed, Jun 07, 2006 at 04:04:57PM +0300, Timo Sirainen wrote:
For multi-master operation the only difficult part is how to allocate UIDs for messages. They always need to be ascending, so there would have to be a global pool where they're allocated from. In case the global pool isn't available, it gets trickier. I guess the UIDs should somehow be temporarily allocated and when all masters are again connected, the conflicting UIDs would have to be reassigned to new UIDs.
See
http://dev.mysql.com/doc/refman/5.0/en/replication-auto-increment.html
for MySQL's solution to multi-master auto-increment (did someone say MySQL doesn't do multi-master? I've never tested it, but the link is in 5.0 docs, and 5.1 is out).
This deals with non-conflicting UIDs, but I'm not sure about how the "ascending" UID restriction would cope with one master being queried when the other is non connected/non up-to-date.
Timo,
the idea of a MySQL backend isn't because you want to win a speed contest against other storage methods. The idea is because with MySQL you can do things that would be nearly impossible without a database. There are things I'd like to do that are basically unthinkable now. Databases would allow me to do indexing on a whole new level. I could add awesome featues to my email system and it would be incredible for spam control.
On example for spam control. Suppose a spam attack comes in delivering the same spam to thousands of users. And suppose that my system didn't realise it was a spam until after the spam was delivered. I could delete all email from that host for all users in seconds if I make the sender's host address a key field. You can't block spam retroactively without a database.
Marc Perkel wrote:
On example for spam control. Suppose a spam attack comes in delivering the same spam to thousands of users. And suppose that my system didn't realise it was a spam until after the spam was delivered. I could delete all email from that host for all users in seconds if I make the sender's host address a key field. You can't block spam retroactively without a database.
It's called 'SEARCH HEADER Received "$host" ON $date' and 'STORE $messages +FLAGS.SILENT (\Deleted)' in the IMAP lingo :)
Cheers, -jkt
-- cd /local/pub && more beer > /dev/mouth
participants (26)
-
Ben
-
Charles Marcus
-
Curtis Maloney
-
Eric S. Johansson
-
Geo Carncross
-
Grobian
-
Jakob Curdes
-
Jakob Hirsch
-
Jan Kundrát
-
John Peacock
-
Ken A
-
Kenneth Porter
-
Les Mikesell
-
Lev A. Serebryakov
-
Lev Serebryakov
-
Lorens
-
Marc Perkel
-
nodata
-
nodata
-
Odhiambo WASHINGTON
-
Oliver Schulze L.
-
Scott Penrose
-
Simon Waters
-
Steffen Kaiser
-
Timo Sirainen
-
Udo Rader