[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

newer
[Dovecot] patch: allow proxy to...

Timo Sirainen

10 Aug 2009 10 Aug '09

8:01 p.m.

This is something I figured out a few months ago, mainly because this one guy at work (hi, Stu) kept telling me my multi-master replication plan sucked and we should use some existing scalable database. (I guess it didn't go exactly like that, but that's the result anyway.)

So, my current plan is based on a couple of observations:

Index files are really more like memory dumps. They're already in an optimal format for keeping them in memory, so they can be just mmap()ed and used. Doing some kind of translation to another format would just make it more complex and slower.
I can change all indexing and dbox code to not require any locks or overwriting files. I just need very few filesystem operations, primarily the ability to atomically append to a file.
Index and mail data is very different. Index data is accessed constantly and it must be very low latency or performance will be horrible. It practically should be in memory in local machine and there shouldn't normally be any network lookups when accessing it.
Mail data on the other hand is just written once and usually read maybe once or a couple of times. Caching mail data in memory probably doesn't help all that much. Latency isn't such a horrible issue as long as multiple mails can be fetched at once / in parallel, so there's only a single latency wait.

So the high level plan is:

Change the index/cache/log file formats in a way that allows lockless writes.
Abstract out filesystem accessing in index and dbox code and implement a regular POSIX filesystem support.
Make lib-storage able to access mails in parallel and send multiple "get mail" requests in advance.

(3.5. Implement async I/O filesystem backend.)

Implement a multi-master filesystem backend for index files. The idea would be that all servers accessing the same mailbox must be talking to each others via network and every time something is changed, push the change to other servers. This is actually very similar to my previous multi-master plan. One of the servers accessing the mailbox would still act as a master and handle conflict resolution and writing indexes to disk more or less often.
Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra. This is the part I've thought the least about, but it's also the part I hope to (mostly) outsource to someone else. I'm not going to write a distributed database from scratch..

This actually should solve several issues:

Scalability, of course. It'll be as scalable as the distributed database being used to store mails.
NFS reliability! Even if you don't care about any of these alternative databases, this still solves NFS caching problems. You'd keep using the regular POSIX FS API (or async FS api) but with the in-memory index "cache", so only a single server is writing to mailbox indexes at the same time.
Shared mailboxes. The filesystem API is abstracted, so it should be possible to easily add another layer to handle accessing other users' mails from both local and remote servers. This should finally make it possible to easily support shared mailboxes with system users.

Attachments:

signature.asc (application/pgp-signature — 197 bytes)

Show replies by date

Seth Mattinen

11 Aug 11 Aug

12:33 a.m.

Timo Sirainen wrote:

...

This is something I figured out a few months ago, mainly because this one guy at work (hi, Stu) kept telling me my multi-master replication plan sucked and we should use some existing scalable database. (I guess it didn't go exactly like that, but that's the result anyway.)

Ick, some people (myself included) hate the idea of storing mail in a database versus simple and almost impossible to screw up plain text files of maildir. Cyrus already does the whole mail-in-database thing.

~Seth

Timo Sirainen

12:53 a.m.

On Mon, 2009-08-10 at 14:33 -0700, Seth Mattinen wrote:

...

Timo Sirainen wrote:

...
This is something I figured out a few months ago, mainly because this one guy at work (hi, Stu) kept telling me my multi-master replication plan sucked and we should use some existing scalable database. (I guess it didn't go exactly like that, but that's the result anyway.)

Ick, some people (myself included) hate the idea of storing mail in a database versus simple and almost impossible to screw up plain text files of maildir.

Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here some of those key/value-like databases, not SQL. (What's a good collective name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

...

Cyrus already does the whole mail-in-database thing.

No, Cyrus's mail database is very similar to how Dovecot works. Both have somewhat similar index files, both store one mail/file (with dbox/maildir). But Cyrus then also has some additional databases that screw up things..

Seth Mattinen

7:41 a.m.

Timo Sirainen wrote:

...

On Mon, 2009-08-10 at 14:33 -0700, Seth Mattinen wrote:

...
Timo Sirainen wrote:

...
This is something I figured out a few months ago, mainly because this one guy at work (hi, Stu) kept telling me my multi-master replication plan sucked and we should use some existing scalable database. (I guess it didn't go exactly like that, but that's the result anyway.)

Ick, some people (myself included) hate the idea of storing mail in a database versus simple and almost impossible to screw up plain text files of maildir.

Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here some of those key/value-like databases, not SQL. (What's a good collective name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

Why is a database a better choice than a clustered filesystem? It seems that you're adding a huge layer of complexity (a database) for something that's already solved (clusters). Queue directories and clusters don't mix well, but a read-heavy maildir/dbox environment shouldn't suffer the same problem.

~Seth

Timo Sirainen

8:32 a.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 11, 2009, at 12:41 AM, Seth Mattinen wrote:

...

...
Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to
distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here
some of those key/value-like databases, not SQL. (What's a good collective
name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

Why is a database a better choice than a clustered filesystem?

Show me a clustered filesystem that can guarantee that each file is
stored in at least 3 different data centers and can scale linearly by
simply adding more servers (let's say at least up to thousands).

Clustered filesystems are also complex. They're much more complex than
what Dovecot really requires.

Seth Mattinen

9:16 a.m.

Timo Sirainen wrote:

...

On Aug 11, 2009, at 12:41 AM, Seth Mattinen wrote:

...
...
Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here some of those key/value-like databases, not SQL. (What's a good collective name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

Why is a database a better choice than a clustered filesystem?

Show me a clustered filesystem that can guarantee that each file is stored in at least 3 different data centers and can scale linearly by simply adding more servers (let's say at least up to thousands).

Easy, AFS. It is known to support tens of thousands of clients [1] and it's not exactly new. Like supporting the quirks of NFS, the quirks of a clustered filesystem could be found and dealt with, too.

Key/value databases are hardly a magic bullet for redundancy. You don't get 3 copies in different datacenters by simply switching to a database-style storage.

[1] http://www-conf.slac.stanford.edu/AFSBestPractices/Slides/MorganStanley.pdf

...

Clustered filesystems are also complex. They're much more complex than what Dovecot really requires.

I mention it because you stated wanting to outsource the storage portion. The complexity of whatever database engine you choose or supporting a clustered filesystem (like NFS) is a wash since you're not maintaining either one personally.

~Seth

Timo Sirainen

9:37 a.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 11, 2009, at 2:16 AM, Seth Mattinen wrote:

...

...
Show me a clustered filesystem that can guarantee that each file is stored in at least 3 different data centers and can scale linearly by simply adding more servers (let's say at least up to thousands).

Easy, AFS. It is known to support tens of thousands of clients [1] and it's not exactly new. Like supporting the quirks of NFS, the quirks
of a clustered filesystem could be found and dealt with, too.

I was more thinking about thousands of servers, not clients. Each
server should contribute to the amount of storage you have. Buying
huge storages is more expensive. Also it would be nice if you could
just keep plugging in more servers to get more storage space, disk I/O
and CPU and the system would just automatically reconfigure itself to
take advantage of those. I can't really see any of that happening
easily with AFS.

...

Key/value databases are hardly a magic bullet for redundancy. You
don't get 3 copies in different datacenters by simply switching to a database-style storage.

Some (several?) of them can be somewhat easily configured to support
that. (That's what their web pages say, anyway.)

...

...
Clustered filesystems are also complex. They're much more complex
than what Dovecot really requires.

I mention it because you stated wanting to outsource the storage portion. The complexity of whatever database engine you choose or supporting a clustered filesystem (like NFS) is a wash since you're
not maintaining either one personally.

I also want something that's cheap and easy to scale. Sure, people who
already have NFS/AFS/etc. systems can keep using Dovecot with the
filesystem backends, but I don't think it's the cheapest or easiest
choice. There's a reason why e.g. Amazon S3 isn't running on top of
them.

Seth Mattinen

11:15 a.m.

Timo Sirainen wrote:

...

On Aug 11, 2009, at 2:16 AM, Seth Mattinen wrote:

...
...
Show me a clustered filesystem that can guarantee that each file is stored in at least 3 different data centers and can scale linearly by simply adding more servers (let's say at least up to thousands).

Easy, AFS. It is known to support tens of thousands of clients [1] and it's not exactly new. Like supporting the quirks of NFS, the quirks of a clustered filesystem could be found and dealt with, too.

I was more thinking about thousands of servers, not clients. Each server should contribute to the amount of storage you have. Buying huge storages is more expensive. Also it would be nice if you could just keep plugging in more servers to get more storage space, disk I/O and CPU and the system would just automatically reconfigure itself to take advantage of those. I can't really see any of that happening easily with AFS.

While that would be fancy, I don't think that level of integration would be compatible with abstracting the filesystem per the original plan, so I didn't consider it. I just considered robust, site independent, scalable storage as you asked for. ;)

OpenAFS is worth a read, at least, to see what it offers and ideas you could incorporate. http://www.dementia.org/twiki/bin/view/AFSLore/GeneralFAQ

It focuses on "users" but you can pretend a user is really "server running Dovecot". AFS also uses Kerberos. That alone would probably disqualify its use for the purposes of simple Dovecot replication. I picked on AFS because it closely matches what you were looking for in scale.

...

...
Key/value databases are hardly a magic bullet for redundancy. You don't get 3 copies in different datacenters by simply switching to a database-style storage.

Some (several?) of them can be somewhat easily configured to support that. (That's what their web pages say, anyway.)

Well, so can a global filesystem designed to do precisely that at the block level. No advantage here.

...

...
...
Clustered filesystems are also complex. They're much more complex than what Dovecot really requires.

I mention it because you stated wanting to outsource the storage portion. The complexity of whatever database engine you choose or supporting a clustered filesystem (like NFS) is a wash since you're not maintaining either one personally.

I also want something that's cheap and easy to scale. Sure, people who already have NFS/AFS/etc. systems can keep using Dovecot with the filesystem backends, but I don't think it's the cheapest or easiest choice. There's a reason why e.g. Amazon S3 isn't running on top of them.

S3 isn't really a fair comparison. There's Google FS too, but they're both purpose built systems.

Now, keep in mind, I have not personally used AFS with Dovecot. My point is to not dismiss building on a clustered file system just because it's old and lacks sex appeal compared to the backend that Facebook uses. UUCP is ancient too, but it still blows away stupid SMTP tricks many people see as modern for disconnected endpoints.

~Seth

ja nein

4:29 p.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

I was more thinking about thousands of servers, not clients. Each server should contribute to the amount of storage you have. Buying huge storages is more expensive. Also it would be nice if you could just keep plugging in more servers to get more storage space, disk I/O and CPU and the system would just automatically reconfigure itself to take advantage of those. I can't really see any of that happening easily with AFS.

Well, me too. But there are interesting (and working) solutions like e.g. GlusterFS...

...

I mention it because you stated wanting to outsource the storage portion. The complexity of whatever database engine you choose or supporting a clustered filesystem (like NFS) is a wash since you're not maintaining either one personally.

I also want something that's cheap and easy to scale. Sure, people who already have NFS/AFS/etc. systems can keep using Dovecot with the filesystem backends, but I don't think it's the cheapest or easiest choice. There's a reason why e.g. Amazon S3 isn't running on top of them.

I think the basic behind the initial idea, which I like very much, is to have a choice between redundancy/scalability and easiness of running a platform.

In my opinion there isn't the perfect solution which addresses all of above in the best way. I think that's why there are so many different solutions out there. Anyway, having indexes centralized in either form of a "database" would be a nice solution (and very important: easy to run in case of SQL!) for not all, but many installations. If the speed penalty and coding penalties/efforts aren't that much, it would be worth to implement solutions like SQL-based index storage, too. And everyone is/would be free to decide which one would be the best for his platform/environment.

Huge installations with servers > 50 will always be a kind of a special solution and won't be built out of the box. Dovecot can just help in having good alternatives of storing all kind of lock-dependant stuff in different ways (files/memory/databases).

Regards, Sebastian

Robert Schetterer

10:02 a.m.

Timo Sirainen schrieb:

...

On Aug 11, 2009, at 12:41 AM, Seth Mattinen wrote:

...
...
Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here some of those key/value-like databases, not SQL. (What's a good collective name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

Why is a database a better choice than a clustered filesystem?

Show me a clustered filesystem that can guarantee that each file is stored in at least 3 different data centers and can scale linearly by simply adding more servers (let's say at least up to thousands).

Clustered filesystems are also complex. They're much more complex than what Dovecot really requires.

i like the idea of sql based mail services whatever your choice is, use of cluster file systems stays ever, but with databased setups it should much more easy to have redudant mailstores, i have all possible stuff quota, acl etc in a database yet, incl spamassassin, greylisting, webmail the only thing which is left ,is the mail store, it would be great if there would be the possibility to have that, if there are no big disadvantages like poor performance etc with it

there is http://www.dbmail.org/ has sombody ever used it ? so it can be compared

Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria

Seth Mattinen

10:35 a.m.

Robert Schetterer wrote:

...

Timo Sirainen schrieb:

...
On Aug 11, 2009, at 12:41 AM, Seth Mattinen wrote:

...
...
Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here some of those key/value-like databases, not SQL. (What's a good collective name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

Why is a database a better choice than a clustered filesystem? Show me a clustered filesystem that can guarantee that each file is stored in at least 3 different data centers and can scale linearly by simply adding more servers (let's say at least up to thousands).

Clustered filesystems are also complex. They're much more complex than what Dovecot really requires.

i like the idea of sql based mail services whatever your choice is, use of cluster file systems stays ever, but with databased setups it should much more easy to have redudant mailstores, i have all possible stuff quota, acl etc in a database yet, incl spamassassin, greylisting, webmail the only thing which is left ,is the mail store, it would be great if there would be the possibility to have that, if there are no big disadvantages like poor performance etc with it

there is http://www.dbmail.org/ has sombody ever used it ? so it can be compared

It wouldn't be an SQL database - it's not really suitable for this kind of thing at the scale Timo is proposing.

~Seth

Eric Jon Rostetter

6:24 p.m.

Quoting Seth Mattinen <sethm@rollernet.us>:

...

Queue directories and clusters don't mix well, but a read-heavy maildir/dbox environment shouldn't suffer the same problem.

Why don't queue directories and clusters mix well? Is this a performance issue only, or something worse?

...

~Seth

-- Eric Rostetter The Department of Physics The University of Texas at Austin

This message is provided "AS IS" without warranty of any kind, either expressed or implied. Use this message at your own risk.

Seth Mattinen

7:38 p.m.

Eric Jon Rostetter wrote:

...

Quoting Seth Mattinen <sethm@rollernet.us>:

...
Queue directories and clusters don't mix well, but a read-heavy maildir/dbox environment shouldn't suffer the same problem.

Why don't queue directories and clusters mix well? Is this a performance issue only, or something worse?

It depends on the locking scheme used by the filesystem. Working queue directories (the ones where stuff comes and goes rapidly) is best suited for a local FS anyway.

~Seth

Timo Sirainen

7:43 p.m.

On Tue, 2009-08-11 at 09:38 -0700, Seth Mattinen wrote:

...

...
Why don't queue directories and clusters mix well? Is this a performance issue only, or something worse?

It depends on the locking scheme used by the filesystem. Working queue directories (the ones where stuff comes and goes rapidly) is best suited for a local FS anyway.

And when a server and its disk dies, the emails get lost :(

Eric Jon Rostetter

9 p.m.

Quoting Timo Sirainen <tss@iki.fi>:

...

...
It depends on the locking scheme used by the filesystem. Working queue directories (the ones where stuff comes and goes rapidly) is best suited for a local FS anyway.

And when a server and its disk dies, the emails get lost :(

It would appear he is not talking about a /var/spool/mail type queue/spool, but the queues where the MTA/AV/Anti-Spam/etc process the mail.

For the most part, on machine crash, this will always result in the mail being lost or resent (resent if it hasn't confirmed the acceptance of the message yet). If done with battery backup, the risk is less, but since most filesystems (local or remote) cache writes in memory, the chances you will lose the mail is high in any case (if still cached in memory).

I agree that for smaller mail systems, the processing queues are best on local fs or in memory (memory for AV/Anti-Spam, local disk for MTA processing). The delivery queues (where the message awaits delivery or is delivered) are best on some other file system (mirrored, distributed, etc).

For a massively scaled system, there may be sufficient performance to put the queues elsewhere. But on a small system, with 90% of the mail being spam/virus/malware, performance will usually dictate local/memory file systems for such queues...

-- Eric Rostetter The Department of Physics The University of Texas at Austin

This message is provided "AS IS" without warranty of any kind, either expressed or implied. Use this message at your own risk.

Steffen Kaiser

12 Aug 12 Aug

10:21 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Tue, 11 Aug 2009, Eric Jon Rostetter wrote:

...

For a massively scaled system, there may be sufficient performance to put the queues elsewhere.

Which also allows that the queue can easily have multiple machines pushing & poping items.

...

But on a small system, with 90% of the mail being spam/virus/malware, performance will usually dictate local/memory file systems for such queues...

Well, this discussion reads a bit like "local filesystems are prone to loose data on crash". Journaling filesystems, RAID1 / 5 / 10, SANs do their job.

However, I guess that Seth and Timo look at the thing from a different point of view, Timo seems to focus on "one queue - multiple accessees", whereas Seth focuses on temporary working directory.

Bye,

Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSoJtbnWSIuGy1ktrAQKj9Af/ajuegRCmDRZq/E7vt3EwDxd6ob8bNaY0 bP0Vu2bs2df/GeGKbrFiOCNyq4NMADTejNie9WQMANSB8dM7qMPjdLD68rbD70+k /UIafifb0fXBlvZTrPvKHGf1grB2qb71NAXhPi0QinbCo1CSdP4+J53XssxElrYD YLpAOBpQFkZ2I3Ji1DDpS4Xu7n0lCG0nf4dB8frtGyBf7BGFis0EpudByAAOMsiJ MesR5jbz3xFD5KM62YWlOyRF/3DaOCSo1DVMg6TG+ddTyulW0mCsxKRQ01Py7khm CKp87ucG77gDR1gn341x7zbhH5TtrC1t4rRzpBBujLDcy8F0DkM4yw== =0WvU -----END PGP SIGNATURE-----

Eric Rostetter

5:01 p.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 12, 2009, at 2:21 AM, Steffen Kaiser <skdovecot@smail.inf.fh-brs.de

...

wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Tue, 11 Aug 2009, Eric Jon Rostetter wrote:

...
For a massively scaled system, there may be sufficient performance to put the queues elsewhere.

Which also allows that the queue can easily have multiple machines
pushing & poping items.

Pushing is easy. Popping can be more problematic, depending on varios
factors.

...

...
But on a small system, with 90% of the mail being spam/virus/malware, performance will usually dictate local/ memory file systems for such queues...

Well, this discussion reads a bit like "local filesystems are prone
to loose data on crash". Journaling filesystems, RAID1 / 5 / 10, SANs do their job.

The issue I brought up is OS caching and is not dependent on the
backend really. Only real solution is redundent storage AND disabling
OS caching, which is not cheap and won't be the best performance.
Always a tradeoff.

...

However, I guess that Seth and Timo look at the thing from a
different point of view, Timo seems to focus on "one queue -
multiple accessees", whereas Seth focuses on temporary working
directory.

Well Timo looks at it from dovecot's point of view.

I look at it from a mail server's point of view (MTA also, etc).

...

Bye,

-- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSoJtbnWSIuGy1ktrAQKj9Af/ajuegRCmDRZq/E7vt3EwDxd6ob8bNaY0 bP0Vu2bs2df/GeGKbrFiOCNyq4NMADTejNie9WQMANSB8dM7qMPjdLD68rbD70+k /UIafifb0fXBlvZTrPvKHGf1grB2qb71NAXhPi0QinbCo1CSdP4+J53XssxElrYD YLpAOBpQFkZ2I3Ji1DDpS4Xu7n0lCG0nf4dB8frtGyBf7BGFis0EpudByAAOMsiJ MesR5jbz3xFD5KM62YWlOyRF/3DaOCSo1DVMg6TG+ddTyulW0mCsxKRQ01Py7khm CKp87ucG77gDR1gn341x7zbhH5TtrC1t4rRzpBBujLDcy8F0DkM4yw== =0WvU -----END PGP SIGNATURE-----

paulmon

28 Sep 28 Sep

7 p.m.

On Mon, 2009-08-10 at 14:33 -0700, Seth Mattinen wrote:

...

Nothing forces you to switch from maildir, if you're happy with it :) But if you want to support millions of users, it's simpler to distribute the storage and disk I/O evenly across hundreds of servers using a database that was designed for it. And by databases I mean here some of those key/value-like databases, not SQL. (What's a good collective name for those dbs anyway? BASE and NoSQL are a couple names I've seen.)

Timo, I've been thinking the same exact thing as you lately. As mail starts to move away from traditional "pop3" users to more online storage in the form of webmail the scalability of maildir for large multi giabyte mailboxes goes out the window, loading "cur" in that type of scenario takes WAY too long. Gmail on Maildir isn't possible. I can't speak for anyone else buy my users are moving into webmail, POP users are becoming rare.

My current thinking is a key/value store as you've proposed. Something like Hadoop components or Project Voldamort. Voldamort might be a better fit from what I've read. The main issue here is applications such as local delivery as well as pop/imap access would need to be rewritten to support this. Obviously creating a Hadoop or Voldamort aware local delivery agent means being able to stay away from writing a complete MTA, likewise if one treats IMAP as the main way of accessing a mailbox (proxies for POP3 for example) then a new local delivery agent and IMAPd with key/value "smarts" would all that would be needed to create this system.

My current thinking if having the local delivery break messages up into their component pieces, headers, from address, to address, spam scores, body etc into various key:value relationships. Combine this with the replication support of systems such as Hadoop or Voldamort and you end up with a massively scalable based on commodity hardware. You get rid of RAID completely, remove NFS servers and replace with a cluster of "beige boxes" with ~4 drives each. Redundancy is handled by the native replication in the key:value application itself (Voldamort for example can replicate upto 3 times) on each machine, so yes, you would store a single message more than once but if each of your "beige box" storage systems have 4*2TB drives your cost of storage is far less than the cost of traditional NFS server manufacturers.

Anyways, this is just something that's currently floating in my head...

Paul

-- View this message in context: http://www.nabble.com/Scalability-plans%3A-Abstract-out-filesystem-and-make-... Sent from the Dovecot mailing list archive at Nabble.com.

Ed W

7:57 p.m.

paulmon wrote:

...

My current thinking if having the local delivery break messages up into their component pieces, headers, from address, to address, spam scores, body etc into various key:value relationships.

Whilst this looks appealing on the surface I think the details are going to need some benchmarking to see if they stackup. Certainly I hope this new abstraction works out because I wonder if we won't see a bunch of interesting ideas get implemented, such as you describe!!

Just to knock your theoretical idea around a bit though. My guess would be that you need to look at the access patterns for this data to make sure you don't over normalise it. eg if it's "normal" to simply open up a mailbox and then ask it for every one of the following X fields for the each message, then over normalising the header fields will lead to response time being dominated by access times for each field (especially if that creates a disk seek, etc).

At present I think dovecot's architecture kind of assumes that random access dominates for individual email message and then it optimises for a particular case of header accesses by caching those into a local "database" type structure which "caches" just a certain amount of recently requested header fields. The access times then seem to be bounded by time to scan the inbox for new unseen messages and update this index with maildir (not sure what bounds mailbox scanning times in general use?). ie it's optimising for returning every field X from every message in a folder, or else it is returning bits of a given message?

I should imagine that in general this architecture is near optimal for the general case and the main improvement is just in speeding up the updates after new emails are added/deleted... (done automatically at present if you use deliver, incurs a speed hit if you update yourself)

I should imagine that once you add a requirement to distribute the data and handle failover, etc then the problems of any cache coherency dominate the design and this could be interesting to play with ideas to solve this.

Anyway, I think the point is that for anyone who hasn't tried it yet, to first have a look at how your favourite IMAP client implements imap and watch the stream of commands being issued... It's usually quite a bit different to what you expect and to me it's a lot different to what might be optimal if I got to design their algorithm...

The point being that you shouldn't optimise too much for what you hope people will do, so much as have a look at your favourite webmail client or desktop client and optimise for whatever stream of idiocy they request you to keep pumping at them...

I for one look forward to these changes - I desperately hope I get some time to then play with some ideas because like you I'm itching to play with my "next greatest idea"!!

My only request to Timo was to kind of consider that a bunch of these ideas from the audience will almost certainly involve splitting up the mime message into component parts and that the abstracted interface should try not to throw away any potential speed benefit that this might achieve because the interface can't express what it needs clearly enough?

Good luck

Ed W

Timo Sirainen

8:04 p.m.

On Mon, 2009-09-28 at 17:57 +0100, Ed W wrote:

...

My only request to Timo was to kind of consider that a bunch of these ideas from the audience will almost certainly involve splitting up the mime message into component parts and that the abstracted interface should try not to throw away any potential speed benefit that this might achieve because the interface can't express what it needs clearly enough?

It might become too complex to initially consider how to support split MIME messages and such. I'm not really sure if it even belongs to this filesystem abstraction layer. I was hoping that the FS API would be really really simple and could also be used for other things than just email.

But I'm also hoping to support things like single-instance storage at some point. I'm not really sure if that should just be written into dbox code directly or try to abstract it out..

Ed W

8:35 p.m.

Timo Sirainen wrote:

...

On Mon, 2009-09-28 at 17:57 +0100, Ed W wrote:

...
My only request to Timo was to kind of consider that a bunch of these ideas from the audience will almost certainly involve splitting up the mime message into component parts and that the abstracted interface should try not to throw away any potential speed benefit that this might achieve because the interface can't express what it needs clearly enough?

It might become too complex to initially consider how to support split MIME messages and such. I'm not really sure if it even belongs to this filesystem abstraction layer. I was hoping that the FS API would be really really simple and could also be used for other things than just email.

Well, I think if you just implement a wrapper around read(fh, start, count) then it's going to be quite hard to implement some kind of storage which splits out the message in some way?

I guess the API would need to lineup with the IMAP commands to retrieve mime parts. For the most part these are poorly supported by clients, so I guess most mail clients will undo all this cleverness, but I would imagine it will have a low impact on performance since it's just extra seeks on fetching individual messages?

I am starting to see newer clients finally get this right though. I'm using profimail on my N97 and whilst I didn't look at it's imap stream it *seems* to be doing everything right from the client point of view.
I even get to choose to download all the message if the size is Y and ignore larger attachments than Z, etc (In theory Thunderbird does this, but at least on my machine it just repeatedly downloads the same message again and again in various ways - it grinds to a halt every time I click on an email with a decent sized attachment, even if I have already read it... grr)

...

But I'm also hoping to support things like single-instance storage at some point. I'm not really sure if that should just be written into dbox code directly or try to abstract it out..

I agree it should at least initially go into the dbox, etc code. I guess if a enough people do the same implementation (in all the new backends which I'm sure will arrive within days of some API coming out....) it could bubble up, etc?

I would have thought that your API will prefer to request message parts where it can (eg header, body, mime part), and just issue a read_bytes, where that's what the client is asking for otherwise. This would allow the storage engine to optimise where it can and sadly for the dumb client we just stream bytes since that's all they asked for...

Perhaps the API should also request specific headers from the storage engine where possible and ask for all headers only where it's necessary? This would allow an sql database to be heavily normalised (I'm sure performance is iffy, but we have to pre-suppose some reason why this design is useful for other reasons)

Does this seem feasible?

Ed W

Timo Sirainen

9 p.m.

On Mon, 2009-09-28 at 18:35 +0100, Ed W wrote:

...

I would have thought that your API will prefer to request message parts where it can (eg header, body, mime part), and just issue a read_bytes, where that's what the client is asking for otherwise. This would allow the storage engine to optimise where it can and sadly for the dumb client we just stream bytes since that's all they asked for...

In my mind this is more about what lib-storage API was supposed to abstract out, whereas my filesystem API would be used simply for binary data storage. The same FS API could be used to store both dbox files and index files.

...

Perhaps the API should also request specific headers from the storage engine where possible and ask for all headers only where it's necessary? This would allow an sql database to be heavily normalised (I'm sure performance is iffy, but we have to pre-suppose some reason why this design is useful for other reasons)

This is really going towards what lib-storage API is supposed to do already.. It's not even horribly difficult to write a new backend for it. For example in v2.0 the fully functional Cydir backend code looks like:

% wc *[ch] 152 357 3740 cydir-mail.c 319 783 8420 cydir-save.c 402 1087 10806 cydir-storage.c 35 82 1085 cydir-storage.h 187 465 4798 cydir-sync.c 24 54 615 cydir-sync.h 1119 2828 29464 total

There is still a bit of code duplication between backends that could reduce the line count by maybe 100-200 lines. Anyway I think the only good way to implement support for normalized SQL database in Dovecot would be to implement a new lib-storage backend, and it shouldn't be a hugely difficult job.

Ed W

9:21 p.m.

Timo Sirainen wrote:

...

On Mon, 2009-09-28 at 18:35 +0100, Ed W wrote:

...
I would have thought that your API will prefer to request message parts where it can (eg header, body, mime part), and just issue a read_bytes, where that's what the client is asking for otherwise. This would allow the storage engine to optimise where it can and sadly for the dumb client we just stream bytes since that's all they asked for...

In my mind this is more about what lib-storage API was supposed to abstract out, whereas my filesystem API would be used simply for binary data storage. The same FS API could be used to store both dbox files and index files.

I guess in this case it would be interesting to hear the kind of use cases you imagine that the storage API will be used for in practice? I think I might be kind of overthinking the problem?

Seems like it's a very thin shim between a real file system and dovecot and would be mainly useful for supporting filesystems with non posix protocols, eg someone wants to store their mail files on mogile or DAV, but it doesn't address anything lower or higher than blocks of data?

I can see how this would be useful in certain scenarios, but kind of interested to hear where you think it will go?

Seems like it would be useful for:

implementing very specific optimisations for example for NFS
optimisation for filesystems with unusual strengths/weaknesses, eg GFS or Gluster?
non posix file system storage (but without trying to leverage particular features of that storage)

Actually, it mainly seems like a way for you to break out the access paths for NFS versus local storage when I write it down like that?

Ed W

Timo Sirainen

9:45 p.m.

On Mon, 2009-09-28 at 19:21 +0100, Ed W wrote:

...

...
In my mind this is more about what lib-storage API was supposed to abstract out, whereas my filesystem API would be used simply for binary data storage. The same FS API could be used to store both dbox files and index files.

I guess in this case it would be interesting to hear the kind of use cases you imagine that the storage API will be used for in practice? I think I might be kind of overthinking the problem?

lib-storage API has existed since Dovecot v1.0 and it's used to abstract out access to maildir, mbox, dbox, cydir, etc. SQL would fit right there with those.

Or did you mean FS API? For that my plans are to implement backends for:

POSIX (just the way it works now)
Async I/O (once Dovecot can do more things in parallel)
Some kind of proxying to support shared mailboxes between different servers (or within same server when users are using different UIDs and don't have a common group)
Massively distributed database storage for mails
In-memory cache for index files, which permanently writes them using another storage. This is useful for any kind of multi-master setup like distributed database, NFS, clusterfs.

...

Seems like it's a very thin shim between a real file system and dovecot and would be mainly useful for supporting filesystems with non posix protocols, eg someone wants to store their mail files on mogile or DAV, but it doesn't address anything lower or higher than blocks of data?

Right, path/filename (or "key") -> binary byte stream.

...

Seems like it would be useful for:

implementing very specific optimisations for example for NFS

optimisation for filesystems with unusual strengths/weaknesses, eg GFS or Gluster?

In both of these I think the primary problem is that Dovecot tries to do IPC via filesystem (index files). So accessing the indexes via the in-memory cache that is guaranteed to be always up-to-date would get rid of all these ugly NFS cache flushing attempts etc.

Ed W

10:11 p.m.

Timo Sirainen wrote:

...

On Mon, 2009-09-28 at 19:21 +0100, Ed W wrote:

...
...
In my mind this is more about what lib-storage API was supposed to abstract out, whereas my filesystem API would be used simply for binary data storage. The same FS API could be used to store both dbox files and index files.

I guess in this case it would be interesting to hear the kind of use cases you imagine that the storage API will be used for in practice? I think I might be kind of overthinking the problem?

lib-storage API has existed since Dovecot v1.0 and it's used to abstract out access to maildir, mbox, dbox, cydir, etc. SQL would fit right there with those.

OK, I thought that was what you were going to be simplifying...

I did have a poke around in there some time back and it did feel "quite complicated" to follow what was going on... I found your sql backend code as a simpler way to poke around, but even there it was pretty quickly going to need some earnest digging to figure out how it was all working...

OK, I guess this can never be an easy middle ground - presumably things are as they are for a reason...

Cheers

Ed W

Timo Sirainen

10:20 p.m.

On Mon, 2009-09-28 at 20:11 +0100, Ed W wrote:

...

...
lib-storage API has existed since Dovecot v1.0 and it's used to abstract out access to maildir, mbox, dbox, cydir, etc. SQL would fit right there with those.

OK, I thought that was what you were going to be simplifying...

Nope. It can still be simplified a bit, but only a bit. :) But in every release I am always simplifying it, moving more and more code to common functions and making the API more powerful and cleaner at the same time. :)

...

I did have a poke around in there some time back and it did feel "quite complicated" to follow what was going on... I found your sql backend code as a simpler way to poke around, but even there it was pretty quickly going to need some earnest digging to figure out how it was all working...

The SQL code was for v1.0 and the lib-storage API has been simplified since then, maybe not hugely but still pretty much. Maybe some day I'll see about updating the SQL code for v2.0 API.

Oh and some documentation about it would probably help a lot too. I guess I should write some, someday. :)

Ed W

29 Sep 29 Sep

12:20 a.m.

Timo Sirainen wrote:

...

The SQL code was for v1.0 and the lib-storage API has been simplified since then, maybe not hugely but still pretty much. Maybe some day I'll see about updating the SQL code for v2.0 API.

Oh and some documentation about it would probably help a lot too. I guess I should write some, someday. :)

Some overview docs might be somewhat helpful for sure, but I think at this level you probably mainly need to get your hands dirty

Having an example storage engine which is also a bit simpler (eg an updated sql engine) would actually be quite good for this I suspect. I quickly dropped looking at the real code for playing with the sql code and found it quite a bit simpler to get an overview

Thanks and interested to see this progress

Ed W

Timo Sirainen

12:24 a.m.

On Mon, 2009-09-28 at 22:20 +0100, Ed W wrote:

...

Timo Sirainen wrote:

...
The SQL code was for v1.0 and the lib-storage API has been simplified since then, maybe not hugely but still pretty much. Maybe some day I'll see about updating the SQL code for v2.0 API.

Oh and some documentation about it would probably help a lot too. I guess I should write some, someday. :)

Some overview docs might be somewhat helpful for sure, but I think at this level you probably mainly need to get your hands dirty

I was thinking something that would describe what kind of groups of functions there exist ("mailbox listing", "mailbox opening", "message saving", etc.) and what functions need to be used to properly call them.

...

Having an example storage engine which is also a bit simpler (eg an updated sql engine) would actually be quite good for this I suspect.

You mean "a bit different" :) Cydir already is the simplest storage engine there is.

Charles Marcus

28 Sep 28 Sep

9:43 p.m.

On 9/28/2009, Ed W (lists@wildgooses.com) wrote:

...

In theory Thunderbird does this, but at least on my machine it just repeatedly downloads the same message again and again in various ways

it grinds to a halt every time I click on an email with a decent sized attachment, even if I have already read it... grr

TB3 has finally fixed this absurd behavior (yay!)...

In fact there are lots of IMAP improvements in v3... I can't wait until all my extensions catch up, and I figure out how to customize the UI the way I want (e.g., how in the world do I get rid of the stupid Tabs??)

Best regards,

Charles

Jeff Grossman

11:24 p.m.

On 9/28/2009 11:43 AM, Charles Marcus wrote:

...

On 9/28/2009, Ed W (lists@wildgooses.com) wrote:

...
In theory Thunderbird does this, but at least on my machine it just repeatedly downloads the same message again and again in various ways

it grinds to a halt every time I click on an email with a decent sized attachment, even if I have already read it... grr

TB3 has finally fixed this absurd behavior (yay!)...

In fact there are lots of IMAP improvements in v3... I can't wait until all my extensions catch up, and I figure out how to customize the UI the way I want (e.g., how in the world do I get rid of the stupid Tabs??)

You can't get rid of tabs per se, but you can make it so you don't use them. I hate tabs personally also. Go to Options, Advanced, Reading and Display, and select Open Messages In: An Existing Window or A New Window. I use an existing window.

Charles Marcus

11:44 p.m.

On 9/28/2009 4:24 PM, Jeff Grossman wrote:

...

...
In fact there are lots of IMAP improvements in v3... I can't wait until all my extensions catch up, and I figure out how to customize the UI the way I want (e.g., how in the world do I get rid of the stupid Tabs??)

...

You can't get rid of tabs per se, but you can make it so you don't use them. I hate tabs personally also. Go to Options, Advanced, Reading and Display, and select Open Messages In: An Existing Window or A New Window. I use an existing window.

Yeah, already did that, but it *does* still use the Tab bar, everything is just limited to one tab - the Tab row is still there wasting my screen real estate.

I'll figure out how to kill it... I know I'm not the only one who hates/won't use it...

Best regards,

Charles

Charles Marcus

29 Sep 29 Sep

12:10 a.m.

On 9/28/2009, Charles Marcus (CMarcus@Media-Brokers.com) wrote:

...

...
You can't get rid of tabs per se, but you can make it so you don't use

...
them. I hate tabs personally also. Go to Options, Advanced, Reading and Display, and select Open Messages In: An Existing Window or A New Window. I use an existing window.

Yeah, already did that, but it *does* still use the Tab bar, everything is just limited to one tab - the Tab row is still there wasting my screen real estate.

I'll figure out how to kill it... I know I'm not the only one who hates/won't use it...

Ahhh... found it...

about:config > mail.tabs.autohide set to true...

Getting there...

Best regards,

Charles

Jeff Grossman

12:14 a.m.

On 9/28/2009 1:44 PM, Charles Marcus wrote:

...

On 9/28/2009 4:24 PM, Jeff Grossman wrote:

...
...
In fact there are lots of IMAP improvements in v3... I can't wait until all my extensions catch up, and I figure out how to customize the UI the way I want (e.g., how in the world do I get rid of the stupid Tabs??)

...
You can't get rid of tabs per se, but you can make it so you don't use them. I hate tabs personally also. Go to Options, Advanced, Reading and Display, and select Open Messages In: An Existing Window or A New Window. I use an existing window.

Yeah, already did that, but it *does* still use the Tab bar, everything is just limited to one tab - the Tab row is still there wasting my screen real estate.

I'll figure out how to kill it... I know I'm not the only one who hates/won't use it...

Your right. Sorry about that. Not sure how to completely get rid of the tab bar.

Timo Sirainen

28 Sep 28 Sep

7:58 p.m.

On Mon, 2009-09-28 at 09:00 -0700, paulmon wrote:

...

My current thinking is a key/value store as you've proposed. Something like Hadoop components or Project Voldamort. Voldamort might be a better fit from what I've read.

My understanding of Hadoop is that it's more about distributed computing instead of storage.

...

My current thinking if having the local delivery break messages up into their component pieces, headers, from address, to address, spam scores, body etc into various key:value relationships.

I was planning on basically just storing key=username/message-guid, value=message pairs instead of splitting it up. Or perhaps split header and body, but I think piecing it smaller than those just makes the performance worse. To get different headers quickly there would still be dovecot.index.cache (which would be in a some quick in-memory storage but also stored in the database).

Ed W

8:23 p.m.

Timo Sirainen wrote:

...

On Mon, 2009-09-28 at 09:00 -0700, paulmon wrote:

...
My current thinking is a key/value store as you've proposed. Something like Hadoop components or Project Voldamort. Voldamort might be a better fit from what I've read.

My understanding of Hadoop is that it's more about distributed computing instead of storage.

I believe it's possible to use it to ask lots of machines to parse a bit of database and then get the answer back from all of them. eg some people are alleged to be using it to parse huge log files in sensible time by splitting up their log files across lots of machines and asking each of them to do a bit of filtering...

I'm out of my depth at this point - only read the executive summary...

...

...
My current thinking if having the local delivery break messages up into their component pieces, headers, from address, to address, spam scores, body etc into various key:value relationships.

I was planning on basically just storing key=username/message-guid, value=message pairs instead of splitting it up. Or perhaps split header and body, but I think piecing it smaller than those just makes the performance worse. To get different headers quickly there would still be dovecot.index.cache (which would be in a some quick in-memory storage but also stored in the database).

This can presumably be rephrased as:

access times are say 10ms
linear read times are say 60MB/sec
Therefore don't break up a message into more than 0.010s * 60MB = 600KB (ish) chunks or your seek times dominate simply doing linear reads and throwing away what you don't need...
Obviously insert whatever timings you like and re-run the numbers, eg if you have some fancy pants flash drive then insert shorter seek times

However, these numbers and some very limited knowledge of how a small bunch of email clients seem to behave would suggest that the following is also worth optimising to varying degrees (please don't overlook someone wanting to implement some backend to try these ideas):

Theory: Attachments larger than K are worth breaking out according to the formula above Justification: Above actually a fairly small attachment size it's cheaper to do a seek than to linear scan to the next mail message. For some storage designs this might be helpful (mbox type packing).
Additionally some users have suggested that they want to try and single instance popular attachments, so "K" might be customisable, or better yet some design might choose to keep a cache of attachment fingerprints and de-dup them when a dup is next seen..

Theory: breakout (all) headers from bodies Justification: scanning headers seems a popular task and dovecot keeps a local database to optimise the common case. Reseeks would be slow though and some storage designs might be able to optimise and get fast linear seeks across all headers (eg pack them as per mbox and compress them?)

Theory: breakout individual headers Justification: err... not got a good case for this one, but some of these fancy key value databases are optimised for creating views on certain headers across certain messages. I imagine this won't fly in practice, but it seems a shame not to try it... Definitely anyone implementing an SQL database option will want to try it though... (bet it's slow though...)

Theory: pack message bodies together as per mbox Justification: mbox seems faster, compresses better and all round seems better than maildir for access speed, except in certain circumstances such as deletes. Dovecot already seems to optimise some corner cases by just marking messages dead without deleting them, so clearly there is tremendous scope for improvement here (dbox going down this route?).
Some bright spark might design some backend which uses multiple mbox files to overcome the huge hit when "defragging" and it may well be that by incorporating eg splitting out larger attachments, and lightly compressing, then some workloads might see some really good performance! (Could be really interesting for archive mailboxes, etc?)

Just my 2p...

Curtis Maloney

11 Aug 11 Aug

8:03 a.m.

Seth Mattinen wrote:

...

Ick, some people (myself included) hate the idea of storing mail in a database versus simple and almost impossible to screw up plain text files of maildir. Cyrus already does the whole mail-in-database thing.

Why do you think 'maildir' isn't a database?

Or to you does 'database' only mean "SQL database"?

"""A database is a collection of information that is organized so that it can easily be accessed, managed, and updated."""

-- Curtis Maloney

Seth Mattinen

8:07 a.m.

Curtis Maloney wrote:

...

Seth Mattinen wrote:

...
Ick, some people (myself included) hate the idea of storing mail in a database versus simple and almost impossible to screw up plain text files of maildir. Cyrus already does the whole mail-in-database thing.

Why do you think 'maildir' isn't a database?

Or to you does 'database' only mean "SQL database"?

Please, don't put words in my mouth. I'm not stupid.

~Seth

Steffen Kaiser

5:32 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Mon, 10 Aug 2009, Timo Sirainen wrote:

...

Implement a multi-master filesystem backend for index files. The idea would be that all servers accessing the same mailbox must be talking to each others via network and every time something is changed, push the change to other servers. This is actually very similar to my previous multi-master plan. One of the servers accessing the mailbox would still act as a master and handle conflict resolution and writing indexes to disk more or less often.

What I don't understand here is:

_One_ server is the master, which owns the indexes locally? Oh, 5. means that this particular server is initiating the write, right?

You spoke about thousends of servers, if one of them opens a mailbox, it needs to query all (thousends - 1) servers, which of them is probably the master of this mailbox. I suppose you need a "home location" server, which other servers connect to, in order to get server currently locking (aka acting as master for) this mailbox.

GSM has some home location register pointing to the base station currently managing the user info, because the GSM device is in its reach.

There is also another point I'm wondering about: index files are "really more like memory dumps", you've wrote. so if you cluster thousends of servers together you'll most probably have different server architectures, say 32bit vs. 64bit, CISC vs. RISC, big vs. little endian, ASCII vs. EBCDIC :). To share these memory dumps without another abstraction layer wouldn't work.

...

Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra. This

Although I like the "eventually consistent" part, I wonder about the Java-based stuff of Cassandra.

...

is the part I've thought the least about, but it's also the part I hope to (mostly) outsource to someone else. I'm not going to write a distributed database from scratch..

I wonder if the index-backend in 4. and 5. shouldn't be the same.

===

How many work is it to handle the data in the index files? What if any server forwards changes to the master and recieves changes from the master to sync its local read-only cache? So you needn't handle conflicts (except when network was down) and writes are consistent originated from this single master server. The actual mail data is accessed via another API.

When the current master does no longer need to access the mailbox, it could hand over the "master" stick to another server currently accessing the mailbox.

Bye,

Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSoGA6XWSIuGy1ktrAQKGjggAh9Yjzy2oFI2H8MS2rppm/ug2HWO+9PGX aTRrzNzj2wTScAL1NrFZrN8Mlc7qK2YfH3rXDbM5Mcw/eC67VQ2P2XcetTY7h5XK RxFqk5+h3Q06Jiwl0IFQyCxkRzs4bK6cZegjAfSViDfQTx8iQhvXHxioPLvIiFQH D3lOd7+QUxOLKJyAxejjDM5ez/9OUFXZF9WeWrDGpQYES5HVNND3T288uBwWx5zJ hwqQI8qR3Fwu9VRSDLpvCx1DjQWGOT7x6DfIaKg2j6IvvSTpH2dMsNg0M3YmLsvY JyreDtqMlZDLclg00ELx0ORgQVHN5eQpOs/XgmFF0+YBQvAO6mtrUw== =1GC8 -----END PGP SIGNATURE-----

Timo Sirainen

6:29 p.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 11, 2009, at 10:32 AM, Steffen Kaiser wrote:

...

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Mon, 10 Aug 2009, Timo Sirainen wrote:

...

Implement a multi-master filesystem backend for index files. The
idea would be that all servers accessing the same mailbox must be
talking to each others via network and every time something is changed, push the change to other servers. This is actually very similar to my previous multi-master plan. One of the servers accessing the mailbox would
still act as a master and handle conflict resolution and writing indexes to disk more or less often.

What I don't understand here is:

_One_ server is the master, which owns the indexes locally? Oh, 5. means that this particular server is initiating the write,
right?

Yes, only one would be writing to the shared storage.

...

You spoke about thousends of servers, if one of them opens a
mailbox, it needs to query all (thousends - 1) servers, which of
them is probably the master of this mailbox. I suppose you need a
"home location" server, which other servers connect to, in order to
get server currently locking (aka acting as master for) this mailbox.

Yeah, keeping track of this information is probably the most difficult
part. But surely it can be done faster than with (thousands-1)
queries :)

...

There is also another point I'm wondering about: index files are "really more like memory dumps", you've wrote. so if
you cluster thousends of servers together you'll most probably have
different server architectures, say 32bit vs. 64bit, CISC vs. RISC,
big vs. little endian, ASCII vs. EBCDIC :). To share these memory
dumps without another abstraction layer wouldn't work.

Nah, x86 is all there is ;) Dovecot has been fine so far with this
same design. I think only once I've heard that someone wanted to run
both little and big endian machines with shared NFS storage. 32 vs. 64
bit doesn't matter though, indexes have been bitness-independent since
v1.0.rc9.

I was tried to make the code use the same endianess everywhere, but
the code quickly became so ugly that I decided to just drop it. But
who knows, maybe some day. :)

...

...

Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra.
This

Although I like the "eventually consistent" part, I wonder about the
Java-based stuff of Cassandra.

I'm not yet sure what database exactly to use. I'm not really familiar
with any of them, except the Amazon Dynamo whitepaper that I read, and
that seemed perfect to me. Cassandra still seems to lack some features
that I think are needed.

...

...
is the part I've thought the least about, but it's also the part I
hope to (mostly) outsource to someone else. I'm not going to write a distributed database from scratch..

I wonder if the index-backend in 4. and 5. shouldn't be the same.

You mean the permanent index storage? Yes, it probably should be the
same in 4 and 5. 4 just has that in-memory layer in the middle.

...

How many work is it to handle the data in the index files? What if any server forwards changes to the master and recieves
changes from the master to sync its local read-only cache? So you
needn't handle conflicts (except when network was down) and writes
are consistent originated from this single master server. The actual
mail data is accessed via another API.

When the current master does no longer need to access the mailbox,
it could hand over the "master" stick to another server currently
accessing the mailbox.

http://dovecot.org/tmp/replication-plan.txt explains how I previously
thought about the index replication to work, and I think it'll still
work pretty nicely with the index FS backend too. I guess it could
mostly work like sending everything to master, although for some
changes it wouldn't really be necessary. I'll need to rethink the plan
for this I guess.

Ed W

12 Aug 12 Aug

6:26 p.m.

...

Mail data on the other hand is just written once and usually read maybe once or a couple of times. Caching mail data in memory probably doesn't help all that much. Latency isn't such a horrible issue as long as multiple mails can be fetched at once / in parallel, so there's only a single latency wait.

This logically seems correct. Couple of questions then:

Since latency requirements are low, why did performance drop so much previously when you implemented a very simple mysql storage backend? I glanced at the code a few weeks ago and whilst it's surprisingly complicated right now to implement a backend, I was also surprised that a database storage engine "sucked" I think you phrased it? Possibly the code also placed the indexes on the DB? Certainly this could very well kill performance? (Note I'm not arguing sql storage is a good thing, I just want to understand the latency to backend requirements)
I would be thinking that with some care, even very high latency storage would be workable, eg S3/Gluster/MogileFs ? I would love to see a backend using S3 - If nothing else I think it would quickly highlight all the bottlenecks in any design...

...

Implement a multi-master filesystem backend for index files. The idea would be that all servers accessing the same mailbox must be talking to each others via network and every time something is changed, push the change to other servers. This is actually very similar to my previous multi-master plan. One of the servers accessing the mailbox would still act as a master and handle conflict resolution and writing indexes to disk more or less often.

Take a look at Mogilefs for some ideas here. I doubt it's a great fit, but they certainly need to solve a lot of the same problems

...

Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra.

CouchDB? It is just the Lotus Notes database after all, and personally I have built some *amazing* applications using that as the backend. (I just love the concept of Notes - the gui is another matter...)

Note that CouchDB is interesting in that it is multi-master with "eventual" synchronisation. This potentially has some interesting issues/benefits for offline use

For the filesystem backend have you looked at the various log structured filesystems appearing? Whenever I watch the debate between Maildir vs Mailbox I always think that a hybrid is the best solution because you are optimising for a write one, read many situation, where you have an increased probability of having good cache localisation on any given read.

To me this ends up looking like log structured storage... (which feels like a hybrid between maildir/mailbox)

...

Scalability, of course. It'll be as scalable as the distributed database being used to store mails.

I would be very interested to see a kind of "where the time goes" benchmark of dovecot. Have you measured and found that latency of this part accounts for x% of the response time and CPU bound here is another y%, etc? eg if you deliberately introduce X ms of latency in the index lookups, what does that do to the response time of the system once the cache warms up? What about if the response time to the storage backend changes? I would have thought this would help you determine how to scale this thing?

All in all sounds very interesting. However, couple of thoughts:

What is the goal?
If the goal is performance by allowing a scale-out in quantity of servers then I guess you need to measure it carefully to make sure it actually works? I haven't had the fortune to develop something that big, but the general advice is that scaling out is hard to get right, so assume you made a mistake in your design somewhere... Measure, measure
If the goal is reliability then I guess it's prudent to assume that somehow all servers will get out of sync (eventually). It's definitely nice if they are self repairing as a design goal, eg the difference between a full sync and shipping logs (I ship logs to have a master-master mysql server, but if we have a crash then I use a sync program (maatkit) to check the two servers are in sync and avoid recreating one of the servers from fresh)
If the goal is increased storage capacity on commodity hardware then it needs a useful bunch of tools to manage the replication and make sure there is redundancy and it's easy to find the required storage. I guess look at Mogilefs, if you think you can do better then at least remember it was quite hard work to get to that stage, so doing it again is likely to be non trivial?
If the goal were making it simpler to build a backend storage engine then this would be excellent - I find myself wanting to benchmark ideas like S3 or sticking things in a database, but I looked at the API recently and it's going to require a bit of investment to get started - certainly more than a couple of evenings poking around... Hopefully others would write interesting backends, regardless of whether it's sensible to use them on high performance setups, some folks simply want/need to do unusual things...
Finally I am a bit sad that offline distributed multi-master isn't in the roadmap anymore... :-( - My situation is we have a lot of boats boating around with intermittent expensive satellite connections and the users are fluid and need to get access to their data from land and different vessels. Currently we build software inhouse to make this possible, but it would be fantastic to see more features enabling this on the server side (CouchDB / Lotus Notes is cool...)

Good luck - sounds fun implementing all this anyway!

Ed W

Timo Sirainen

7:12 p.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 12, 2009, at 11:26 AM, Ed W wrote:

...

Hi

...

Mail data on the other hand is just written once and usually read maybe once or a couple of times. Caching mail data in memory probably doesn't help all that much. Latency isn't such a horrible issue as
long as multiple mails can be fetched at once / in parallel, so there's
only a single latency wait.

This logically seems correct. Couple of questions then:

Since latency requirements are low, why did performance drop so
much previously when you implemented a very simple mysql storage
backend? I glanced at the code a few weeks ago and whilst it's
surprisingly complicated right now to implement a backend, I was
also surprised that a database storage engine "sucked" I think you
phrased it? Possibly the code also placed the indexes on the DB?
Certainly this could very well kill performance? (Note I'm not
arguing sql storage is a good thing, I just want to understand the
latency to backend requirements)

Yes, it placed indexes also to SQL. That's slow. But even without it,
Dovecot code needs to be changed to access more mails in parallel
before the performance can be good for high-latency mail storages.

...

I would be thinking that with some care, even very high latency
storage would be workable, eg S3/Gluster/MogileFs ? I would love to
see a backend using S3 - If nothing else I think it would quickly
highlight all the bottlenecks in any design...

Yes, S3 should be possible. With dbox it could even be used to store
the old mails and keep new mails in lower latency storage.

...

...

Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra.

CouchDB? It is just the Lotus Notes database after all, and
personally I have built some *amazing* applications using that as
the backend. (I just love the concept of Notes - the gui is another
matter...)

Note that CouchDB is interesting in that it is multi-master with
"eventual" synchronisation. This potentially has some interesting
issues/benefits for offline use

CouchDB seems like it would still be more difficult than necessary to
scale. I'd really just want something that distributes the load and
disk usage evenly across all servers and allows easily plugging in
more servers and it automatically rebalances the load. CouchDB seems
like much of that would have to be done manually (or building scripts
to do it).

...

For the filesystem backend have you looked at the various log
structured filesystems appearing? Whenever I watch the debate
between Maildir vs Mailbox I always think that a hybrid is the best
solution because you are optimising for a write one, read many
situation, where you have an increased probability of having good
cache localisation on any given read.

To me this ends up looking like log structured storage... (which
feels like a hybrid between maildir/mailbox)

Hmm. I don't really see how it looks like log structured storage.. But
you do know that multi-dbox is kind of a maildir/mbox hybrid, right?

...

...

Scalability, of course. It'll be as scalable as the distributed database being used to store mails.

I would be very interested to see a kind of "where the time goes"
benchmark of dovecot. Have you measured and found that latency of
this part accounts for x% of the response time and CPU bound here is
another y%, etc? eg if you deliberately introduce X ms of latency
in the index lookups, what does that do to the response time of the
system once the cache warms up? What about if the response time to
the storage backend changes? I would have thought this would help
you determine how to scale this thing?

I haven't really done any explicit benchmarks, but there are a few
reasons why I think low-latency for indexes is really important:

All commands that access mails in any ways need to first do index
lookup first to find the mail.
Anything using IMAP UIDs need to do a binary search on the index
to find the mail.
Anything accessing mail metadata needs to do dovecot.index.cache
lookups, often many of them. For example FETCH ENVELOPE does something
like 10 lookups to cache for each mail.
After each command Dovecot needs to check if there are new mails
by checking if dovecot.index.log has changed.

I think it's pretty obvious that if any of those lookups had latency
the performance would soon become pretty horrible. And the reasons why
I think the actual mail storage can live with high latency:

Whenever processing a command, Dovecot knows beforehand what kind
of data it needs. It can quickly go through index/cache file to find
out what message contents it needs to have, and then send requests to
all of those immediately. (Or if there are hundreds, maybe always have
something like 20 queued, or whatever is good.) After the first one
has arrived, the rest should already be available immediately then for
access.
That first initial latency hit is a bit bad, but it probably isn't
horrible. Gmail IMAP seems to do ok with pretty high latencies..
If message data lives in multiple servers, commands that access a
large number of mails can run faster since the data can be fetched
from multiple servers in parallel, so there's less disk I/O wait.

And why I don't really care much about CPU bottlenecks: As far as I
know, there aren't any. CPU load is typically close to 0%.

...

All in all sounds very interesting. However, couple of thoughts:

What is the goal?

If the goal is performance by allowing a scale-out in quantity of
servers then I guess you need to measure it carefully to make sure
it actually works? I haven't had the fortune to develop something
that big, but the general advice is that scaling out is hard to get
right, so assume you made a mistake in your design somewhere...
Measure, measure

I don't think it's all that much about performance of a single user,
but more about distributing the load more evenly in an easier way.
That's basically done by outsourcing the problem to the underlying
storage (database).

...

If the goal is reliability then I guess it's prudent to assume
that somehow all servers will get out of sync (eventually). It's
definitely nice if they are self repairing as a design goal, eg the
difference between a full sync and shipping logs (I ship logs to
have a master-master mysql server, but if we have a crash then I use
a sync program (maatkit) to check the two servers are in sync and
avoid recreating one of the servers from fresh)

Yes, resolving conflicts due to split brain merging back is something
I really want to make work as well as it can. The backend database can
hopefully again help here (by noticing there was a conflict and
allowing the program to resolve it).

...

If the goal is increased storage capacity on commodity hardware
then it needs a useful bunch of tools to manage the replication and
make sure there is redundancy and it's easy to find the required
storage. I guess look at Mogilefs, if you think you can do better
then at least remember it was quite hard work to get to that stage,
so doing it again is likely to be non trivial?

This is again something I'm hoping to outsource to the backend database.

...

If the goal were making it simpler to build a backend storage
engine then this would be excellent - I find myself wanting to
benchmark ideas like S3 or sticking things in a database, but I
looked at the API recently and it's going to require a bit of
investment to get started - certainly more than a couple of evenings
poking around... Hopefully others would write interesting backends,
regardless of whether it's sensible to use them on high performance
setups, some folks simply want/need to do unusual things...

This is also one of its goals :) Even if I make a mistake in choosing
a bad database first, it should be somewhat easy to implement another
backend again. The backend FS API will be pretty simple. Basically
it's going to be:

fd = open(path, mode) where mode is one of

recreate atomically once writing is finished (for recreating
dovecot.index)
create atomically, fail if it already exists (for creating new files)
append to existing file

read(fd, offset, size)
write(fd, data) - not visible to others until flush() is called
flush(fd, &offset) - if offset is specified, it's set to the offset
where data was actually written to.
unallocate(fd, size) - to free disk space from beginning of file

Then perhaps some kind of readdir() for listing mailboxes, but I
haven't thought of that yet. Since there's no practical way to do
unallocate() with POSIX, it can be done by creating a couple of
different files and rotating them (the way dovecot.index.log is done).

I'll probably write a more detailed explanation how this is going to
work at some point. Although there are a couple of details I'd still
like to improve.

...

Finally I am a bit sad that offline distributed multi-master isn't
in the roadmap anymore... :-(

I think dsync can do that. It'll do two-way syncing between Dovecots
and resolves all conflicts. Is the syncing itself still done with very
high latencies, i.e. something like USB sticks? That's currently not
really working, but it probably wouldn't be too difficult. The
protocol is currently something like:

Get list of remote's mailboxes and mails.
Sync mailbox
Wait for "all messages saved ok" reply. Goto 2.

The 2 and 3 parts are done in both ways at the same time. The extra
wait there is to allow some good way to handle COPY failures (dsync
prefers to copy messages instead of re-sending them, if possible) and
I think there were some other reasons too.

So for USB stick -like sync I guess it would need to be something like:

Read remote's mailbox list and highestmodseq values from a file.
Write changes based on modseqs to file, save each mail separately
instead of using copy. (Or perhaps it could do some kind of COPY
fallbacking. Send all possible COPYs followed by SAVE. So it wouldn't
reduce the traffic, but it could reduce disk space if copying can be
done by e.g. hard linking.)
Move the USB stick to the other machine.
It reads the file, applies changes, saves mailbox list and highest
modseq values to file.

Ed W

7:46 p.m.

...

...

Since latency requirements are low, why did performance drop so much previously when you implemented a very simple mysql storage backend? I glanced at the code a few weeks ago and whilst it's surprisingly complicated right now to implement a backend, I was also surprised that a database storage engine "sucked" I think you phrased it? Possibly the code also placed the indexes on the DB? Certainly this could very well kill performance? (Note I'm not arguing sql storage is a good thing, I just want to understand the latency to backend requirements)

Yes, it placed indexes also to SQL. That's slow. But even without it, Dovecot code needs to be changed to access more mails in parallel before the performance can be good for high-latency mail storages.

My expectation then is that with local index and sql message storage that the performance should be very reasonable for a large class of users... (ok, other problems perhaps arise)

...

...

I would be thinking that with some care, even very high latency storage would be workable, eg S3/Gluster/MogileFs ? I would love to see a backend using S3 - If nothing else I think it would quickly highlight all the bottlenecks in any design...

Yes, S3 should be possible. With dbox it could even be used to store the old mails and keep new mails in lower latency storage.

Mogile doesn't handle S3, but I always thought it would be terrific to be able to have one copy of your data on fast local storage, but to be able to use slower (sometimes cheaper) storage for backups or less valuable data (eg older messages), ie replicating some data to other storage boxes

...

CouchDB seems like it would still be more difficult than necessary to scale. I'd really just want something that distributes the load and disk usage evenly across all servers and allows easily plugging in more servers and it automatically rebalances the load. CouchDB seems like much of that would have to be done manually (or building scripts to do it).

Ahh fair enough - I thought it being massively multi-master would allow simply querying different machines for different users. Not a perfect scale-out, but good enough for a whole class of requirements...

...

...
For the filesystem backend have you looked at the various log structured filesystems appearing? Whenever I watch the debate between Maildir vs Mailbox I always think that a hybrid is the best solution because you are optimising for a write one, read many situation, where you have an increased probability of having good cache localisation on any given read.

To me this ends up looking like log structured storage... (which feels like a hybrid between maildir/mailbox)

Hmm. I don't really see how it looks like log structured storage.. But you do know that multi-dbox is kind of a maildir/mbox hybrid, right?

Well the access is largely append only, with some deletes and noise at the writing end, but largely the older storage stays static with much longer gaps between deletes (and extremely infrequent edits)

So maildir is optimised really for deletes, but improves random access for a subset of operations. Mailbox is optimised for writes and seems like it's generally fast for most operations except deletes (people do worry about having a lot of eggs in one basket, but I think this is really a symptom of other problems at work). Mailbox also has improved packing for small messages and probably has improved cache locality on certain read patterns

So one obvious hybrid would be a mailbox type structure which perhaps splits messages up into variable sized sub mailboxes based on various criteria, perhaps including message age, type of message or message size...? The rapid write delete would happen at the head, perhaps even as a maildir layout and gradually the storage would become larger and ever more compressed mailboxes as the age/frequency of access/etc declines.

Perhaps this is exactly dbox?

It would also be interesting to consier separate message headers from body content. Have heavy localisation of message headers, and slower higher latency access to the message body. Would this improve access speeds in general? Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

...

I haven't really done any explicit benchmarks, but there are a few reasons why I think low-latency for indexes is really important:

I think low latency for indexes is a given. You appear to have architected the system so that all responses are delivered from the index and baring an increase in index efficiency the remaining time is spent doing the initial generation and maintenance of those indexes. I would have thought bar downloading an entire INBOX that the access time of individual mails was very much secondary?

...

...

If the goal is performance by allowing a scale-out in quantity of servers then I guess you need to measure it carefully to make sure it actually works? I haven't had the fortune to develop something that big, but the general advice is that scaling out is hard to get right, so assume you made a mistake in your design somewhere... Measure, measure

I don't think it's all that much about performance of a single user, but more about distributing the load more evenly in an easier way. That's basically done by outsourcing the problem to the underlying storage (database).

So perhaps something like CouchDB can work then? One user localises per replica and you keep reusing that replica?

...

Yes, resolving conflicts due to split brain merging back is something I really want to make work as well as it can. The backend database can hopefully again help here (by noticing there was a conflict and allowing the program to resolve it).

In general conflict resolution is thrown back to the application, so likely this is going to become a dovecot problem. It seems that the general class of problem is too hard to solve at the storage side

...

This is also one of its goals :) Even if I make a mistake in choosing a bad database first, it should be somewhat easy to implement another backend again. The backend FS API will be pretty simple. Basically it's going to be:

I wouldn't get too held back by posix semantics. For sure they are memorable, but definitely consider that transactions are the key to any kind of database performance improvement and make sure you can batch together stuff to make good use of the backend. Your "flush" command seems to be the implicit end of transaction, but I guess give it plenty of thought that you might have a super slow system (eg S3) and the backend might want the flexibility to mark something "kind of done", while uploading for 30 seconds in the background, then marking it properly done once S3 actually acks the data saved?

...

...

Finally I am a bit sad that offline distributed multi-master isn't in the roadmap anymore... :-(

I think dsync can do that. It'll do two-way syncing between Dovecots and resolves all conflicts. Is the syncing itself still done with very high latencies, i.e. something like USB sticks? That's currently not really working, but it probably wouldn't be too difficult.

What is dsync? There is a dsync.org which is some kind of directory synchroniser?

Aha, google suggests that I might have missed an email from you recently... Will read up...

OK, this sounds like a better implementation of the kind of thing we are building here - likely this is the way ahead!

Cheers

Ed W

Timo Sirainen

8:17 p.m.

On Wed, 2009-08-12 at 17:46 +0100, Ed W wrote:

...

My expectation then is that with local index and sql message storage that the performance should be very reasonable for a large class of users... (ok, other problems perhaps arise)

If messages are stored to SQL in dummy blobs then the performance is probably comparable to any other database I'm thinking about.

...

...
Yes, S3 should be possible. With dbox it could even be used to store the old mails and keep new mails in lower latency storage.

Mogile doesn't handle S3, but I always thought it would be terrific to be able to have one copy of your data on fast local storage, but to be able to use slower (sometimes cheaper) storage for backups or less valuable data (eg older messages), ie replicating some data to other storage boxes

dsync can do the replication, dbox can have primary/secondary partitions for message data (if mail is not found from primary, it's looked up from secondary). All that's needed is lib-storage backend for S3, or using some filesystem layer to it. :)

...

...
CouchDB seems like it would still be more difficult than necessary to scale. I'd really just want something that distributes the load and disk usage evenly across all servers and allows easily plugging in more servers and it automatically rebalances the load. CouchDB seems like much of that would have to be done manually (or building scripts to do it).

Ahh fair enough - I thought it being massively multi-master would allow simply querying different machines for different users. Not a perfect scale-out, but good enough for a whole class of requirements...

If users' all mails are stuck on a particular cluster of servers, it's possible that suddenly several users in those servers starts increasing their disk load or disk usage and starts killing the performance / available space for others. If a user's mails were spread across 100 servers, this would be much less likely.

...

...
Hmm. I don't really see how it looks like log structured storage.. But you do know that multi-dbox is kind of a maildir/mbox hybrid, right?

Well the access is largely append only, with some deletes and noise at the writing end, but largely the older storage stays static with much longer gaps between deletes (and extremely infrequent edits)

Ah, right. I guess if you think about it from a "single user's mails" point of view.

...

So maildir is optimised really for deletes, but improves random access for a subset of operations. Mailbox is optimised for writes and seems like it's generally fast for most operations except deletes (people do worry about having a lot of eggs in one basket, but I think this is really a symptom of other problems at work). Mailbox also has improved packing for small messages and probably has improved cache locality on certain read patterns

Yes, this is why I'm also using mbox on dovecot.org for mailing list archives.

...

So one obvious hybrid would be a mailbox type structure which perhaps splits messages up into variable sized sub mailboxes based on various criteria, perhaps including message age, type of message or message size...? The rapid write delete would happen at the head, perhaps even as a maildir layout and gradually the storage would become larger and ever more compressed mailboxes as the age/frequency of access/etc declines.

Perhaps this is exactly dbox?

Something like that. In dbox you have one storage directory containing all mailboxes' mails (so that copying can be done by simple index updates). Then you have a bunch of files, each about n MB (configurable, 2 MB by default). Expunging initially only marks the message as expunged in index. Then later (or immediately, configurable) you run a cronjob that goes through all dboxes and actually removes the used space by recreating those dbox files.

...

It would also be interesting to consier separate message headers from body content. Have heavy localisation of message headers, and slower higher latency access to the message body. Would this improve access speeds in general?

Probably not much. Usually I think clients download a specific set of headers, and those can be looked up from dovecot.index.cache file. Although if a new header is looked up from all messages that's not in cache already, it would be faster to go through headers if they were packed together separately. But then again that would make it maybe a bit slower to download full message, since it's split to two places.

I don't really know, but my feeling is that it wouldn't benefit all that much.

...

Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

Yes, this is also something in dbox's far future plans.

...

...
I haven't really done any explicit benchmarks, but there are a few reasons why I think low-latency for indexes is really important:

I think low latency for indexes is a given. You appear to have architected the system so that all responses are delivered from the index and baring an increase in index efficiency the remaining time is spent doing the initial generation and maintenance of those indexes. I would have thought bar downloading an entire INBOX that the access time of individual mails was very much secondary?

There are of course clients that can download lots of mails, one command at a time.. I guess with those some kind of predictive prefetching could help.

...

...
Yes, resolving conflicts due to split brain merging back is something I really want to make work as well as it can. The backend database can hopefully again help here (by noticing there was a conflict and allowing the program to resolve it).

In general conflict resolution is thrown back to the application, so likely this is going to become a dovecot problem. It seems that the general class of problem is too hard to solve at the storage side

Right. I really want to be able to handle the conflict resolution myself.

...

...
This is also one of its goals :) Even if I make a mistake in choosing a bad database first, it should be somewhat easy to implement another backend again. The backend FS API will be pretty simple. Basically it's going to be:

I wouldn't get too held back by posix semantics. For sure they are memorable, but definitely consider that transactions are the key to any kind of database performance improvement and make sure you can batch together stuff to make good use of the backend. Your "flush" command seems to be the implicit end of transaction, but I guess give it plenty of thought that you might have a super slow system (eg S3) and the backend might want the flexibility to mark something "kind of done", while uploading for 30 seconds in the background, then marking it properly done once S3 actually acks the data saved?

Well, the API's details can change, but those are the basic operations it needs. The important points are that it doesn't need to overwrite data in existing files and it doesn't need locking, but it does need atomic appends with ability to know the offset where the append actually saved the data.

Ed W

8:42 p.m.

...

...
...
CouchDB seems like it would still be more difficult than necessary to scale. I'd really just want something that distributes the load and disk usage evenly across all servers and allows easily plugging in more servers and it automatically rebalances the load. CouchDB seems like much of that would have to be done manually (or building scripts to do it).

Ahh fair enough - I thought it being massively multi-master would allow simply querying different machines for different users. Not a perfect scale-out, but good enough for a whole class of requirements...

If users' all mails are stuck on a particular cluster of servers, it's possible that suddenly several users in those servers starts increasing their disk load or disk usage and starts killing the performance / available space for others. If a user's mails were spread across 100 servers, this would be much less likely.

Sure - I'm not a couchdb expert, but I think the point is that we would need to check the replication options because you would simply balance the requests across all the servers holding those users' data. I'm kind of assuming that data would be replicated across more than one server and there would be some way of choosing which server to use for a given user

I only know couchdb to the extent of having glanced at the website some time back, but I liked the way it looks and thinks like Lotus Notes (I did love building things using that tool about 15 years ago - the replication was just years ahead of it's time. The robustness was extraordinary and I remember when the IRA blew up a chunk of Manchester (including one of our servers) that everyone just went home and started using the Edinburgh or London office servers and carried on as though nothing happened...)

Actually it's materialised views are rather clever also...

...

...
...
Hmm. I don't really see how it looks like log structured storage.. But you do know that multi-dbox is kind of a maildir/mbox hybrid, right?

Well the access is largely append only, with some deletes and noise at the writing end, but largely the older storage stays static with much longer gaps between deletes (and extremely infrequent edits)

Ah, right. I guess if you think about it from a "single user's mails" point of view.

Well, single folder really

...

...
So maildir is optimised really for deletes, but improves random access for a subset of operations. Mailbox is optimised for writes and seems like it's generally fast for most operations except deletes (people do worry about having a lot of eggs in one basket, but I think this is really a symptom of other problems at work). Mailbox also has improved packing for small messages and probably has improved cache locality on certain read patterns

Yes, this is why I'm also using mbox on dovecot.org for mailing list archives.

Actually I use maildir, but apart from delete performance which is usually rare, mailbox seems better for nearly all use patterns

Seems like if it were possible to "solve" delete performance then mailbox becomes the preferred choice for many requirements (also lets solve the backup problem where the whole file changes every day)

...

...
So one obvious hybrid would be a mailbox type structure which perhaps splits messages up into variable sized sub mailboxes based on various criteria, perhaps including message age, type of message or message size...? The rapid write delete would happen at the head, perhaps even as a maildir layout and gradually the storage would become larger and ever more compressed mailboxes as the age/frequency of access/etc declines.

Perhaps this is exactly dbox?

Something like that. In dbox you have one storage directory containing all mailboxes' mails (so that copying can be done by simple index updates). Then you have a bunch of files, each about n MB (configurable, 2 MB by default). Expunging initially only marks the message as expunged in index. Then later (or immediately, configurable) you run a cronjob that goes through all dboxes and actually removes the used space by recreating those dbox files.

Yeah, sounds good.

You might consider some kind of "head optimisation", where we can already assume that the latest chunk of mails will be noisy and have a mixture of deletes/appends, etc. Typically mail arrives, gets responded to, gets deleted quickly, but I would *guess* that if a mail survives for XX hours in a mailbox then likely it's going to continue to stay there for quite a long time until some kind of purge event happens (user goes on a purge, archive task, etc)

Sounds good anyway

Oh, have you considered some "optional" api calls in the storage API?
The logic might be to assume that someone wanted to do something clever and split the message up in some way, eg store headers separately to bodies or bodies carved up into mime parts. The motivation would be if there was a certain access pattern to optimise. Eg for an SQL database it may well be sensible to split headers and the message body in order to optimise searching - the current API may not take advantage of that?

Ed W

Timo Sirainen

8:51 p.m.

On Wed, 2009-08-12 at 18:42 +0100, Ed W wrote:

...

...
Something like that. In dbox you have one storage directory containing all mailboxes' mails (so that copying can be done by simple index updates). Then you have a bunch of files, each about n MB (configurable, 2 MB by default). Expunging initially only marks the message as expunged in index. Then later (or immediately, configurable) you run a cronjob that goes through all dboxes and actually removes the used space by recreating those dbox files.

Yeah, sounds good.

You might consider some kind of "head optimisation", where we can already assume that the latest chunk of mails will be noisy and have a mixture of deletes/appends, etc. Typically mail arrives, gets responded to, gets deleted quickly, but I would *guess* that if a mail survives for XX hours in a mailbox then likely it's going to continue to stay there for quite a long time until some kind of purge event happens (user goes on a purge, archive task, etc)

If disk space usage isn't such a huge problem, I think the nightly purges solve this issue too. During the day user may get mails and delete them, and at night the deleted mails are purged. Perhaps it could help a bit if new mails were all stored in separate file(s) and at night then appended to some larger existing file, but that optimization can be left until later. :)

...

Oh, have you considered some "optional" api calls in the storage API? The logic might be to assume that someone wanted to do something clever and split the message up in some way, eg store headers separately to bodies or bodies carved up into mime parts. The motivation would be if there was a certain access pattern to optimise. Eg for an SQL database it may well be sensible to split headers and the message body in order to optimise searching - the current API may not take advantage of that?

Well, files have paths. I think the storage backend can determine from that what type the data is. So if you're writing to mails/foo/bar/123 it means you're storing a message with ID 123 to mailbox "foo/bar". It could then internally parse the message and store its header/body/mime separately.

Ed W

9:17 p.m.

...

...
Oh, have you considered some "optional" api calls in the storage API? The logic might be to assume that someone wanted to do something clever and split the message up in some way, eg store headers separately to bodies or bodies carved up into mime parts. The motivation would be if there was a certain access pattern to optimise. Eg for an SQL database it may well be sensible to split headers and the message body in order to optimise searching - the current API may not take advantage of that?

Well, files have paths. I think the storage backend can determine from that what type the data is. So if you're writing to mails/foo/bar/123 it means you're storing a message with ID 123 to mailbox "foo/bar". It could then internally parse the message and store its header/body/mime separately.

But would the storage be used optimally if there was a requirement to read in all headers from all emails, say in order to build the cache of messages on "Subject", or what about a backend which has some sort of search capability that we could usefully leverage? It's worth considering anyway because this looks like a design to remote the main storage from the imap server side and scale out (massively) so network capacity might be worth planning for being a limited resource?

Does it make sense to push some of the understanding of a message structure down to the storage backend? Perhaps it could be in some way optional with a more bruteforce option available on the dovecot side?
ie like fuse, implement what you need and not more?

Ed W

9:19 p.m.

...

...
Oh, have you considered some "optional" api calls in the storage API? The logic might be to assume that someone wanted to do something clever and split the message up in some way, eg store headers separately to bodies or bodies carved up into mime parts. The motivation would be if there was a certain access pattern to optimise. Eg for an SQL database it may well be sensible to split headers and the message body in order to optimise searching - the current API may not take advantage of that?

Well, files have paths. I think the storage backend can determine from that what type the data is. So if you're writing to mails/foo/bar/123 it means you're storing a message with ID 123 to mailbox "foo/bar". It could then internally parse the message and store its header/body/mime separately.

I actually thought your idea of having a bunch of cut down IMAP type servers as the backend storage talking to a bunch of beefier frontend servers was quite an interesting idea!

Certainly though a simplification of the on-disk API would encourage new storage engines, so perhaps a three tier infrastructure is worth considering? (Frontend, intelligent backend, storage)

Ed W

Timo Sirainen

9:47 p.m.

On Wed, 2009-08-12 at 19:19 +0100, Ed W wrote:

...

I actually thought your idea of having a bunch of cut down IMAP type servers as the backend storage talking to a bunch of beefier frontend servers was quite an interesting idea!

Certainly though a simplification of the on-disk API would encourage new storage engines, so perhaps a three tier infrastructure is worth considering? (Frontend, intelligent backend, storage)

I guess this is something similar to what I wrote in my "v3.0 architecture" mail. This new FS abstraction solves some of those problems that v3.0 was supposed to solve, so I'm not that excited about it anymore. But sure, maybe some day. :) For now I'm anyway more interested about getting a simple FS abstraction done.

Seth Mattinen

9:09 p.m.

Ed W wrote:

...

Actually I use maildir, but apart from delete performance which is usually rare, mailbox seems better for nearly all use patterns

Seems like if it were possible to "solve" delete performance then mailbox becomes the preferred choice for many requirements (also lets solve the backup problem where the whole file changes every day)

I think dbox's hybrid mbox/maildir approach would combine the best of both, but I haven't looked at dbox progress lately.

~Seth

Daniel L. Miller

9:35 p.m.

Timo Sirainen wrote:

...

...
Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

Yes, this is also something in dbox's far future plans.

Speaking as a pathetic little admin of a small site of 20 users, my needs for replication & scalability are quite minor. However, single-instance storage would be a miracle of biblical proportions. Has any progress been made on this? Do you have a roadmap for how you plan on implementing it?

I don't know if you've considered this at all - this was my first thought:

If you're able to store a message with the attachments separately, then you can come up with an attachment database (not meaning to imply SQL backend). Then after breaking the message up into message + attachments, you scan the attachment database to see if it is already present prior to saving it. This could mean that not only could we save on the huge space wasted by idiots merrily forwarding large attachments to multiple people, but even received mails with embedded graphical signatures would benefit.

Daniel

Timo Sirainen

9:51 p.m.

On Wed, 2009-08-12 at 11:35 -0700, Daniel L. Miller wrote:

...

Timo Sirainen wrote:

...
...
Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

Yes, this is also something in dbox's far future plans.

Speaking as a pathetic little admin of a small site of 20 users, my needs for replication & scalability are quite minor. However, single-instance storage would be a miracle of biblical proportions. Has any progress been made on this?

Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

...

Do you have a roadmap for how you plan on implementing it?

I've written about it a couple of times I think, but no specific plans. Something about using hashes anyway.

...

I don't know if you've considered this at all - this was my first thought:

If you're able to store a message with the attachments separately, then you can come up with an attachment database (not meaning to imply SQL backend). Then after breaking the message up into message + attachments, you scan the attachment database to see if it is already present prior to saving it. This could mean that not only could we save on the huge space wasted by idiots merrily forwarding large attachments to multiple people, but even received mails with embedded graphical signatures would benefit.

Yes, that's pretty much how I thought about it. It's anyway going to be dbox-only feature. Would be way too much trouble with other formats.

Charles Marcus

10:26 p.m.

On 8/12/2009, Timo Sirainen (tss@iki.fi) wrote:

...

Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

Our users are constantly in-line forwarding the same emails with (20+MB) attachment(s) to different people, but completely altering the body content, so we would definitely need per mime-part, since only the large binary attachments would be identical.

So, I would also regard this as a miracle (dunno about biblical proportions, but close), as long as it applies server wide - ie, all domains hosted by one particular dovecot instance.

Best regards,

Charles

Charles Marcus

10:33 p.m.

On 8/12/2009, Timo Sirainen (tss@iki.fi) wrote:

...

Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

The only thing I can find about this on the wiki is where it says single instance attachment storage (for dbox) is planned. Is how to accomplish single instance email storage documented anywhere? And is this reliable enough to use on a production system?

The reason I ask is, this would solve *one* of our problems, namely, my having to limit attachments on our mail lists. Since these emails would be identical, I could start allowing large attachments to them and there would be only one actual message stored with the subsequent deliveries being hard links?

Best regards,

Charles

Timo Sirainen

11:02 p.m.

On Wed, 2009-08-12 at 15:33 -0400, Charles Marcus wrote:

...

On 8/12/2009, Timo Sirainen (tss@iki.fi) wrote:

...
Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

The only thing I can find about this on the wiki is where it says single instance attachment storage (for dbox) is planned. Is how to accomplish single instance email storage documented anywhere? And is this reliable enough to use on a production system?

Two possible ways:

a) Just write a script to find identical mails and replace them with hard links to the same file. :)

b) Use deliver -p <file> for delivering mails. You'll probably need to write some kind of a script for delivering mails, so that when it gets called with multiple recipients it can write the mail to a temp file and call deliver -p for each recipient using the same file.

So yeah, either one are as reliable as the script :)

Charles Marcus

13 Aug 13 Aug

3:15 p.m.

On 8/12/2009 4:02 PM, Timo Sirainen wrote:

...

...
...
Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

...

...
The only thing I can find about this on the wiki is where it says single instance attachment storage (for dbox) is planned. Is how to accomplish single instance email storage documented anywhere? And is this reliable enough to use on a production system?

...

Two possible ways:

Heh... ok, so when you said 'it is possible', you didn't mean dovecot has native support for it...

Sadly, since ianap, I will have to wait for something that is officially supported... but thanks for the explanation.

Best regards,

Charles

Steffen Kaiser

14 Aug 14 Aug

12:13 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Wed, 12 Aug 2009, Timo Sirainen wrote:

...

So yeah, either one are as reliable as the script :)

Well, I think this is not what the OP intended :)

This will work for only a little amount of mails I see, because:

a) forwarded messages differ, b) re-sent messges differ in headers, c) many mailing lists sent one mail per subscriber to catch user-specific bounces (headers differ), d) some mail relays or MTAs split the recipients list, if it is too large (headers differ).

Although I would like it to have in Dovecot, it certainly makes same administration stuff on the server more difficult, so I'm not sure, if I would actually use it ... .

Bye,

Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSoUqzXWSIuGy1ktrAQKNegf/ehbepOZv7XgKo2vwJkEXqK/WcvmPJcN3 ddYOXT4Rh/E6cets5QegWMVLc6eX7M7/Uxi8HKcoa+Fg1bAJzeWkQkrPdfwcj5EI KdkhOyx8nC8TrUC8eagKbUp6gvnF4K7TQnOBOTZh2S0rwW35HYuKRJr5OXAGdqmP G0Xs25/AuInKrI/PW/ahmfdtI5c+pkJqg3wKhRv2MjUqsoyg5exAwjg2L+aV1K21 1rYoHNwAzRsx4DhmZJRuGTH0Me2utllvYMu3JpgzlhNIe7lRz6Cr+yuZ0MvEQ/Ey /7lMA/U6qmPpYgmpZ4ddvPROTyiOieQ1KK54JLBW0llv7UimkgqPFw== =HvVe -----END PGP SIGNATURE-----

Daniel L. Miller

12 Aug 12 Aug

10:34 p.m.

Timo Sirainen wrote:

...

On Wed, 2009-08-12 at 11:35 -0700, Daniel L. Miller wrote:

...
Timo Sirainen wrote:

...
...
Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

Yes, this is also something in dbox's far future plans.

Speaking as a pathetic little admin of a small site of 20 users, my needs for replication & scalability are quite minor. However, single-instance storage would be a miracle of biblical proportions. Has any progress been made on this?

Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

Definitely per MIME part.

...

...
Do you have a roadmap for how you plan on implementing it?

I've written about it a couple of times I think, but no specific plans. Something about using hashes anyway.

...
I don't know if you've considered this at all - this was my first thought:

If you're able to store a message with the attachments separately, then you can come up with an attachment database (not meaning to imply SQL backend). Then after breaking the message up into message + attachments, you scan the attachment database to see if it is already present prior to saving it. This could mean that not only could we save on the huge space wasted by idiots merrily forwarding large attachments to multiple people, but even received mails with embedded graphical signatures would benefit.

Yes, that's pretty much how I thought about it. It's anyway going to be dbox-only feature. Would be way too much trouble with other formats.

dbox-only is fine. I could care less about the storage method chosen - filesystem, db, encrypted, whatever - but I believe the impact on storage - and possibly indexes & searching - would be huge.

On the personal greedy side, if you want to see a mass corporate migration to Dovecot, with potential service contracts - that would be a feature worth talking about. I can see IT manager's eyes light up at hearing about such a item - and I've never heard of any other mail server supporting such a thing.

Daniel

Charles Marcus

10:42 p.m.

On 8/12/2009, Daniel L. Miller (dmiller@amfes.com) wrote:

...

dbox-only is fine. I could care less about the storage method chosen

filesystem, db, encrypted, whatever - but I believe the impact on storage - and possibly indexes & searching - would be huge.

It would be huge for us and anyone else that deals with a lot of large attachments (we're in the advertising industry).

...

On the personal greedy side, if you want to see a mass corporate migration to Dovecot, with potential service contracts - that would be a feature worth talking about. I can see IT manager's eyes light up at hearing about such a item

Mine are shining right now... ;)

...

and I've never heard of any other mail server supporting such a thing.

Exchange does... and is the single and only reason I have *considered* switching to it <shudder> in the past few years...

Best regards,

Charles

Timo Sirainen

10:58 p.m.

On Wed, 2009-08-12 at 15:42 -0400, Charles Marcus wrote:

...

...

and I've never heard of any other mail server supporting such a thing.

Exchange does... and is the single and only reason I have *considered* switching to it <shudder> in the past few years...

I heard that the next Exchange version drops that feature, because it typically causes more disk I/O when reading. I don't know if it's still possible to enable it optionally though.

Ed W

11:58 p.m.

Timo Sirainen wrote:

...

On Wed, 2009-08-12 at 15:42 -0400, Charles Marcus wrote:

...
...

and I've never heard of any other mail server supporting such a thing.

Exchange does... and is the single and only reason I have *considered* switching to it <shudder> in the past few years...

I heard that the next Exchange version drops that feature, because it typically causes more disk I/O when reading. I don't know if it's still possible to enable it optionally though.

It also had a bunch of limitations, it was basically only single instance for CC recipients on a message (more or less). Quite a lot of things such as certain types of virus scanning would (I think) easily disable the single instance storage also?

So I doubt it would help in most of the cases mentioned here, ie each time it was re-forwarded internally it would not be single instanced

I still think it would be instructive to do some benchmarks though - often these things look good on paper, but are surprisingly less effective (given the implementation cost) when measured. I'm not disagreeing, just would be interested to see some numbers...

I think perl's Mimetools would make it pretty easy to build something which scanned all files and created a hash of all interesting attachments. Quite possibly there is an even more clever way to get the same through misusing some Dovecot feature?

Good luck!

Ed W

Charles Marcus

13 Aug 13 Aug

3:20 p.m.

On 8/12/2009, Ed W (lists@wildgooses.com) wrote:

...

It also had a bunch of limitations, it was basically only single instance for CC recipients on a message (more or less). Quite a lot of things such as certain types of virus scanning would (I think) easily disable the single instance storage also?

So I doubt it would help in most of the cases mentioned here, ie each time it was re-forwarded internally it would not be single instanced

I still think it would be instructive to do some benchmarks though - often these things look good on paper, but are surprisingly less effective (given the implementation cost) when measured. I'm not disagreeing, just would be interested to see some numbers...

Amazing... I mean, since Exchange is already a 'database', how hard would it be to do it right (checksum each mime part, and use hardlinks for subsequent duplicate checkummed mimeparts)? As long as everything was properly and effectively indexed, it should be easily doable.

Make it do the work at delivery when the load is light enough, and have a background task that does this for all messages that are not flagged as having already been de-duped at delivery time when the load is light enough.

Best regards,

Charles

Charles Marcus

3:13 p.m.

On 8/12/2009 3:58 PM, Timo Sirainen wrote:

...

...
...

and I've never heard of any other mail server supporting such a thing.

...

...
Exchange does... and is the single and only reason I have *considered* switching to it <shudder> in the past few years...

...

I heard that the next Exchange version drops that feature, because it typically causes more disk I/O when reading. I don't know if it's still possible to enable it optionally though.

Wow... I can hear a lot of sysadmins screaming at the top of their lungs if/when they discover this the hard way.

I'm also having trouble figuring out how using hard links (or their equivalent) for messages with large attachments and having only one instance of the attachment could cause *more* disk I/O than having dozens/hundred of multiple copies of the message.

Guess its an Exchange 'feature'... ;)

Best regards,

Charles

Timo Sirainen

5:41 p.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 13, 2009, at 8:13 AM, Charles Marcus wrote:

...

I'm also having trouble figuring out how using hard links (or their equivalent) for messages with large attachments and having only one instance of the attachment could cause *more* disk I/O than having dozens/hundred of multiple copies of the message.

The thinking is that nowadays seeks are what's killing disk I/O, so
whenever possible just do a single large read. With single instance
storage there would be one additional seek (if the message wasn't
already in memory).

Steve

12:29 a.m.

-------- Original-Nachricht --------

...

Datum: Wed, 12 Aug 2009 12:34:40 -0700 Von: "Daniel L. Miller" <dmiller@amfes.com> An: Dovecot Mailing List <dovecot@dovecot.org> Betreff: Re: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else\'s problem

...

Timo Sirainen wrote:

...
On Wed, 2009-08-12 at 11:35 -0700, Daniel L. Miller wrote:

...
Timo Sirainen wrote:

...
...
Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

Yes, this is also something in dbox's far future plans.

Speaking as a pathetic little admin of a small site of 20 users, my needs for replication & scalability are quite minor. However, single-instance storage would be a miracle of biblical proportions. Has any progress been made on this?

Do you need per-MIME part single instance storage, or would per-email be enough? Since the per-email can already done with hard links.

Definitely per MIME part.

...
...
Do you have a roadmap for how you plan on implementing it?

I've written about it a couple of times I think, but no specific plans. Something about using hashes anyway.

...
I don't know if you've considered this at all - this was my first thought:

If you're able to store a message with the attachments separately, then you can come up with an attachment database (not meaning to imply SQL backend). Then after breaking the message up into message + attachments, you scan the attachment database to see if it is already present prior to saving it. This could mean that not only could we save on the huge space wasted by idiots merrily forwarding large attachments to multiple people, but even received mails with embedded graphical signatures would benefit.

Yes, that's pretty much how I thought about it. It's anyway going to be dbox-only feature. Would be way too much trouble with other formats.

dbox-only is fine. I could care less about the storage method chosen - filesystem, db, encrypted, whatever - but I believe the impact on storage - and possibly indexes & searching - would be huge.

On the personal greedy side, if you want to see a mass corporate migration to Dovecot, with potential service contracts - that would be a feature worth talking about. I can see IT manager's eyes light up at hearing about such a item - and I've never heard of any other mail server supporting such a thing.

IBM Lotus Domino has that feature since ages (they call it shared mail). And they don't have that just for normal mails but for archives as well (called single instance store). This feature was first introduced in cc:Mail and then got integrated into Domino and is still there and even extended to work with various backends (like the new DB2 backend). Microsoft copied that concept from them (from my viewpoint the way how MS has done it in the past was horrible. I think newer versions work better but I am not sure).

...

From my experience in doing messaging since 2 decades I can tell you that it is not worth doing single instance store (or how ever you call it). Storage is ultra cheep these days and backup systems are so fast that all the benefits which where valid some years ago are gone today.

It might rock your geek heart to implement something like that but doing the math on costs versus benefits will soon or later show you that today it's not worth doing it.

...

-- Daniel

Steve

-- Neu: GMX Doppel-FLAT mit Internet-Flatrate + Telefon-Flatrate für nur 19,99 Euro/mtl.!* http://portal.gmx.net/de/go/dsl02

Daniel L. Miller

12:54 a.m.

Steve wrote:

...

...
dbox-only is fine. I could care less about the storage method chosen - filesystem, db, encrypted, whatever - but I believe the impact on storage - and possibly indexes & searching - would be huge.

On the personal greedy side, if you want to see a mass corporate migration to Dovecot, with potential service contracts - that would be a feature worth talking about. I can see IT manager's eyes light up at hearing about such a item - and I've never heard of any other mail server supporting such a thing.

IBM Lotus Domino has that feature since ages (they call it shared mail). And they don't have that just for normal mails but for archives as well (called single instance store). This feature was first introduced in cc:Mail and then got integrated into Domino and is still there and even extended to work with various backends (like the new DB2 backend). Microsoft copied that concept from them (from my viewpoint the way how MS has done it in the past was horrible. I think newer versions work better but I am not sure).

...
From my experience in doing messaging since 2 decades I can tell you that it is not worth doing single instance store (or how ever you call it). Storage is ultra cheep these days and backup systems are so fast that all the benefits which where valid some years ago are gone today.

It might rock your geek heart to implement something like that but doing the math on costs versus benefits will soon or later show you that today it's not worth doing it.

I have no experience with Domino, but I just did a Google for "lotus domino shared mail" and read the brief on lotus.com. Based on what I read, it has potential - only splits message headers from bodies and stores the bodies as complete images, without separating attachments.
That helps reduce the load when somebody blasts out a flier to everyone in the company in a single message - but I'm asking for something more ambitious.

If every attachment in a given message is individually scanned to generate some unique identifier, and that identifier then used to determine whether or not it exists in the database - this could have HUGE effects. This now addresses not just the simple broadcast - but some really crazy possibilities.

User A receives a message with an attachment (like a product brochure), likes it, and forwards it to Users B-Z. User F recognizes that product, but has a counter-proposal, so he attaches another brochure and replies to A-Z. Being an idiot, the original attachment is still kept in the reply. User H forwards this message to a buddy at another company for discussion. [...time passes...] Three weeks later, User 101 at the other company gets back from vacation, has just received a message with the original brochure. He forwards it to User A (who started this mess). User A, being a total dimwit, doesn't recognize that he already spread this junk throughout the company last month - so he broadcasts it again.

Under the structure I've proposed, net storage consumed by the attachments should be one copy of attachment 1, and one copy of attachment two, plus headers and any comments in the messages times the number of recipients. Domino would store one copy of attachment 1, then a copy of attachment 1 + attachment 2, then another copy of attachment 1.

This is a minor example - but I just wanted to show SOMETHING to justify the effort.

As far as cheap storage - I agree costs are a fraction of what they once were. But by reducing the amount stored, consider the tradeoffs of reduced caching, smaller differential backups, and reduced archival costs (off-site storage costs often calculated per GB), just to name a few. To me the only down side (other than requiring Timo to invest more blood, sweat, & tears in this project) is how much this costs in message READ time. For me, typical user interaction is reading. As I believe previously mentioned, if the server implements some type of delayed delete function, then delete times are not a concern. And write times are also (I think) a minor concern. But the primary issue is how fast can we retrieve a message + attachments and stream it to the client. It seems to be header lists won't be impacted, so simply pointing the mail client at the server to see a list of mail shouldn't change at all. So then the question is the potential latency from when a user selects a message to when it appears on their screen. Will the time spent searching the disk, and assembling the message, be significant when compared with the network communication between server & client?

-- Daniel

Timo Sirainen

1:18 a.m.

On Wed, 2009-08-12 at 14:54 -0700, Daniel L. Miller wrote:

...

If every attachment in a given message is individually scanned to generate some unique identifier, and that identifier then used to determine whether or not it exists in the database - this could have HUGE effects. This now addresses not just the simple broadcast - but some really crazy possibilities.

Oh BTW. I think dbmail 2.3 does that. Then again I haven't yet seen a stable dbmail version. But looks like they've released 2.3.6 recently that I haven't tested yet.

Timo Sirainen

1:21 a.m.

On Wed, 2009-08-12 at 18:18 -0400, Timo Sirainen wrote:

...

Oh BTW. I think dbmail 2.3 does that. Then again I haven't yet seen a stable dbmail version. But looks like they've released 2.3.6 recently that I haven't tested yet.

Looks like it even does single instance header values:

...

The header caching tables used since 2.2 have been replaced with a new schema, optimized for a much smaller storage footprint, and therefor faster access. Headers are now cached using a single-instance storage pattern, similar to the one used for the message parts. This change also introduces for the first time the appearance of views in the database, which is somewhat experimental because of some uncertainties with regard to the possible performance impact this may have.

But somehow I think the performance isn't going to be very good for downloading the full header if it has to piece it together from lots of fields stored all around the database.

Daniel L. Miller

1:46 a.m.

Timo Sirainen wrote:

...

On Wed, 2009-08-12 at 18:18 -0400, Timo Sirainen wrote:

...
Oh BTW. I think dbmail 2.3 does that. Then again I haven't yet seen a stable dbmail version. But looks like they've released 2.3.6 recently that I haven't tested yet.

Looks like it even does single instance header values:

LOL - I started off hijacking this thread for SIS - and now you just invited the next one: Have you done, or are you aware of, recent comparisons between Dovecot & dbmail? I'd like to think Dovecot is faster, more stable, more feature-rich, and less fattening...

I don't WANT dbmail!

...

...
The header caching tables used since 2.2 have been replaced with a new schema, optimized for a much smaller storage footprint, and therefor faster access. Headers are now cached using a single-instance storage pattern, similar to the one used for the message parts. This change also introduces for the first time the appearance of views in the database, which is somewhat experimental because of some uncertainties with regard to the possible performance impact this may have.

But somehow I think the performance isn't going to be very good for downloading the full header if it has to piece it together from lots of fields stored all around the database.

Do you have performance concerns for what we've been discussing for SIS in Dovecot?

We can spin off some other threads if you'd prefer to return to your original question - but I guess the question on everybody's (well, at least mine) mind right now is will YOU try to implement SIS in the near future? Regardless of the backend used?

-- Daniel

Timo Sirainen

16 Aug 16 Aug

4:32 a.m.

On Wed, 2009-08-12 at 18:18 -0400, Timo Sirainen wrote:

...

On Wed, 2009-08-12 at 14:54 -0700, Daniel L. Miller wrote:

...
If every attachment in a given message is individually scanned to generate some unique identifier, and that identifier then used to determine whether or not it exists in the database - this could have HUGE effects. This now addresses not just the simple broadcast - but some really crazy possibilities.

Oh BTW. I think dbmail 2.3 does that. Then again I haven't yet seen a stable dbmail version. But looks like they've released 2.3.6 recently that I haven't tested yet.

Tested again. Still crashes in the middle of imaptest runs. And imaptest now reports more bugs than last time I tried..

Archiveopteryx probably does SIS and works better.

Charles Marcus

13 Aug 13 Aug

3:46 p.m.

On 8/12/2009, Daniel L. Miller (dmiller@amfes.com) wrote:

...

Under the structure I've proposed, net storage consumed by the attachments should be one copy of attachment 1, and one copy of attachment two, plus headers and any comments in the messages times the number of recipients. Domino would store one copy of attachment 1, then a copy of attachment 1 + attachment 2, then another copy of attachment 1.

Personally, I only care about binary attachments over a certain size.

I have said before, I don't see the value in doing this for every message and for every mime-part. That said, if it doesn't really cost anything extra to do the entire message and all mime-parts, then fine, I don't really have anything against it, as long as it is robust and reliable.

But for example, what I'd really like to be able to do is say something like:

SiS_mode = binary,64K

So only binary attachments over 64KB in size would be checksummed and single instance stored. I don't care bout the headers, or body text, or tiny (< 64KB) sig attachments, or text attachments (PGP sigs, etc).

Again - for shops that must deal with large binary attachments, this would be a god-send.

Our max allowed message size is 50MB, and we typically get anywhere from 2-10 messages a day containing 20, 30, or even 40MB attachments sent to our distribution lists - so these would go to 50+ people, who then forward them to others, etc, etc ad nauseum.

Currently, I have mailman set to hold these, then I go in and strip off the attachment, put it in a shared location, then let the message (minus the attachment) through. But we still have a *lot* of messages like this that don't go through our lists, but are sent to 2, 3, or 10 of our reps individually.

I did a manual approximation on one persons mail store once, abd determined that our total storage requirements, if SiS was implemented for large attachments, would be reduced by about 90-95%. So, from about 2TB currently, to about 100-200GB. That is HUGE, from both a storage *and* backup standpoint.

Best regards,

Charles

Patrick Nagel

14 Aug 14 Aug

8:04 a.m.

New subject: [Dovecot] Attachment extraction, de-duplication (was: Re: Scalability plans: Abstract out filesystem and make it someone else's problem)

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi,

On 2009-08-13 20:46, Charles Marcus wrote: [...]

...

Again - for shops that must deal with large binary attachments, this would be a god-send.

Our max allowed message size is 50MB, and we typically get anywhere from 2-10 messages a day containing 20, 30, or even 40MB attachments sent to our distribution lists - so these would go to 50+ people, who then forward them to others, etc, etc ad nauseum.

Currently, I have mailman set to hold these, then I go in and strip off the attachment, put it in a shared location, then let the message (minus the attachment) through. But we still have a *lot* of messages like this that don't go through our lists, but are sent to 2, 3, or 10 of our reps individually. [...]

I implemented a solution that works well for us, for a couple of months already. It has one serious limitation though, which will make it unsuitable for many environments: All mail receivers who are part of the process will be able to see all attachments of all other mail receivers. So this only works in a cooperative environment.

In short, a script (implemented as filter, getting called by postfix) extracts all attachments on arrival, using ripmime 1. The attachments are then being moved to a Samba share which all receivers can access. Furthermore, the original mail gets altered by altermime 2, which inserts a file:/// link to the attachment(s) at the bottom of the mail and removes the attachment(s) from the mail. Finally, during the weekend, a file deduplication script (hardlink.py

- 3) on the aforementioned Samba server checksums all files in the attachments directory and hardlinks identical files. So this way we save the base64-overhead, duplicate attachments sent to multiple persons. Also file handling on a Samba share is much easier than having to extract attachments via the MUA first.

Patrick.

STAR Software (Shanghai) Co., Ltd. http://www.star-group.net/ Phone: +86 (21) 3462 7688 x 826 Fax: +86 (21) 3462 7779

PGP key: E883A005 https://stshacom1.star-china.net/keys/patrick_nagel.asc Fingerprint: E09A D65E 855F B334 E5C3 5386 EF23 20FC E883 A005 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/

iEYEARECAAYFAkqE8E4ACgkQ7yMg/OiDoAVBPQCff0fk89IiIxL6hmeedbZC3jes mNQAniJhbNx0hwNxNYdgXKGr7bXu2zRN =siTe -----END PGP SIGNATURE-----

Steffen Kaiser

12:36 p.m.

New subject: [Dovecot] Attachment extraction, de-duplication (was: Re: Scalability plans: Abstract out filesystem and make it someone else's problem)

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Fri, 14 Aug 2009, Patrick Nagel wrote:

...

able to see all attachments of all other mail receivers. So this only works in a cooperative environment.

One can extend that scheme a bit.

...

In short, a script (implemented as filter, getting called by postfix) extracts all attachments on arrival, using ripmime [1]. The attachments

We use MIMEDefang on the recieving MTA.

There I remove certain MIME parts and put them on a Webspace, the filename (aka URL) is the seeded SHA1 of the content. So it is not easy to guess an URL without already knowing the seed and the content or the mail itself.

However, the reactions to this action is quite wide spread. Some are glad, because they can download attachments on demand, others hate this extra step. Some user think the mail is altered and the copyright of the sender is infringed. In a few cases, I ripped some pictures from a HTML mail, which caused uproar. Also, the S/Mime and PGP signing won't work, if transmitted in a separate MIME part.

Bye,

Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)

iQEVAwUBSoUwPXWSIuGy1ktrAQI8WAgAmFLqedn6MI900XPx9OcoeK0MONNQ3sRd Yy+FGfiZIQt4BhwyTG5biKrplfBXldz50Gyqf6u6XlOEVmHMR0g8mTw4PGm3GxKi s9FKk4WdqIhmdRDmo81bd+6OgEIIseImYVMHem7p7jMSkIDwUl1wWE35akKT7I0q ns6Npythx7m26urFxR795UU9aFg4ZC1geUsgbd6Q94xPS2f+wXeGFQSBWkpjh+Xh r0tVYRNQHVkcIxStP3ZdD4tDSduhl+OsZ7Y3SsflCOqubC7ROE7n9j6djY1zeJ3+ PUdq8OWj3xikTZFEOuZmcSOBe+KndJKPE47MiIDKasYktsr9HsxyJw== =TITz -----END PGP SIGNATURE-----

Patrick Nagel

1:10 p.m.

New subject: [Dovecot] Attachment extraction, de-duplication

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi,

On 2009-08-14 17:36, Steffen Kaiser wrote:

...

...
able to see all attachments of all other mail receivers. So this only works in a cooperative environment.

One can extend that scheme a bit.

...
In short, a script (implemented as filter, getting called by postfix) extracts all attachments on arrival, using ripmime [1]. The attachments

We use MIMEDefang on the recieving MTA.

There I remove certain MIME parts and put them on a Webspace, the filename (aka URL) is the seeded SHA1 of the content. So it is not easy to guess an URL without already knowing the seed and the content or the mail itself.

However, the reactions to this action is quite wide spread. Some are glad, because they can download attachments on demand, others hate this extra step. Some user think the mail is altered and the copyright of the sender is infringed. In a few cases, I ripped some pictures from a HTML mail, which caused uproar. Also, the S/Mime and PGP signing won't work, if transmitted in a separate MIME part.

Yes, with the security comes the hassle (as usual) - what I forgot to mention was, that the script also inserts a file:/// link to the directory that contains the attachment(s) (for each mail with attachment a new directory is created on the share). So the users can just click that link and their filebrowser opens. They can then see all attachments of a mail, and they can really "work" with them, not just download them (one by one).

The rest sounds familiar ;) I inserted some conditions on which the script stops processing the message, and just passes it along, as if it didn't have an attachment - for example when it finds any signs of PGP signing or encryption...

Another thing that I didn't mention: We do the same for sent mails - a cronjob periodically checks the users' Sent folders for mails that don't already contain the "has been checked for attachments" header (we use maildir; it only checks mails of the last 24 hours for obvious performance reasons). If it finds one, it gets processed by the script - and in any case (attachments or not) it gets the "has been checked for attachments" header set. Afterwards it gets passed to deliver, which files it back into the Sent folder. I wonder if there is a better solution for this... something with inotify would probably help a lot. And sieve being able to call the script would also help a lot ;)

The biggest catch for our users seems to be, that they have to re-attach the file(s) if they want to forward an e-mail. But I think they got used to it, and maybe it helps in promoting protocols that are actually made for transferring files ;)

Patrick.

STAR Software (Shanghai) Co., Ltd. http://www.star-group.net/ Phone: +86 (21) 3462 7688 x 826 Fax: +86 (21) 3462 7779

iEYEARECAAYFAkqFOC0ACgkQ7yMg/OiDoAXRkgCfctN/cZtAeB6Dglp8LKO1EY0E XPsAoKdHxBsyk3JprgbELqfH8/QvE8pt =dWMn -----END PGP SIGNATURE-----

Charles Marcus

1:27 p.m.

New subject: [Dovecot] Attachment extraction, de-duplication

On 8/14/2009, Steffen Kaiser (skdovecot@smail.inf.fh-brs.de) wrote:

...

However, the reactions to this action is quite wide spread. Some are glad, because they can download attachments on demand, others hate this extra step. Some user think the mail is altered and the copyright of the sender is infringed. In a few cases, I ripped some pictures from a HTML mail, which caused uproar. Also, the S/Mime and PGP signing won't work, if transmitted in a separate MIME part.

Yeah, I toyed with the idea of hiring someone to do something like this a long time ago, but these types of things prevented me from pulling the trigger.

I'd much prefer the users to not be able to tell anything different - the attachment should look just like it does now to them...

Best regards,

Charles

Charles Marcus

1:14 p.m.

New subject: [Dovecot] Attachment extraction, de-duplication

On 8/14/2009, Patrick Nagel (patrick.nagel@star-group.net) wrote:

...

I implemented a solution that works well for us, for a couple of months already. It has one serious limitation though, which will make it unsuitable for many environments: All mail receivers who are part of the process will be able to see all attachments of all other mail receivers. So this only works in a cooperative environment.

Heh... I was interested, until I got to the limitation part...

For many emails/attachments this wouldn't be a problem, but for some it would, so it wouldn't work for us...

Also, since ianap, I would much prefer something this fundamental to be a fully supported feature of dovecot, as opposed to something I have to worry about breaking somehow with every update of dovecot...

Best regards,

Charles

Charles Marcus

1:19 p.m.

...

But for example, what I'd really like to be able to do is say something like:

SiS_mode = binary,64K

So only binary attachments over 64KB in size would be checksummed and single instance stored. I don't care bout the headers, or body text, or tiny (< 64KB) sig attachments, or text attachments (PGP sigs, etc).

Also, I don't care about putting them in an SQL db...

It would be good enough for me to also be able to do:

SiS_dir = /var/virtual/mail/attachments

Have all attachments dumped in there and hardlinked to each message, and just use a simple index file in the directory with the attachment name and MD5 checksum (if MD5 is good enough - I'd like to avoid collisions too).

This way the attachments could even be stored in some other filesystem, to keep the big stuff off the main server.

Best regards,

Charles

Ed W

12 Aug 12 Aug

10:03 p.m.

Daniel L. Miller wrote:

...

Timo Sirainen wrote:

...
...
Also the mime structure could be torn apart to store attachments individually - the motivation being single instance storage of large attachments with identical content... Anyway, these seem like very speculative directions...

Yes, this is also something in dbox's far future plans.

Speaking as a pathetic little admin of a small site of 20 users, my needs for replication & scalability are quite minor. However, single-instance storage would be a miracle of biblical proportions.
Has any progress been made on this? Do you have a roadmap for how you plan on implementing it?

I don't know if you've considered this at all - this was my first thought:

If you're able to store a message with the attachments separately, then you can come up with an attachment database (not meaning to imply SQL backend). Then after breaking the message up into message + attachments, you scan the attachment database to see if it is already present prior to saving it. This could mean that not only could we save on the huge space wasted by idiots merrily forwarding large attachments to multiple people, but even received mails with embedded graphical signatures would benefit.

It would be interesting to quickly script something in perl (see one of the Mime parsers) to simply scan every email on your system, do an MD5 of each mime part, then stick this in a dictionary (with the size) and count the number of hits greater than one (ie duplicate parts). Count the bytes saved and share the script so we can all have a play

I do like the idea of single instance storage, but I'm actually willing to bet it makes only a few percent difference in storage cost for the majority of mail servers (I dare say your mileage will vary, but my point was to benchmark it)

I don't mean this as a negative, but more that I nearly scripted this a couple of months back for my own needs and then ran out of time. I think it won't be more than 50 lines of perl and would be interesting to see how people's numbers vary?

Ed W

Daniel L. Miller

13 Aug 13 Aug

1:54 a.m.

Ha! Fooled you! I'm going to reply to the original question instead of SIS!

Timo Sirainen wrote:

...

Index files are really more like memory dumps. They're already in an optimal format for keeping them in memory, so they can be just mmap()ed and used. Doing some kind of translation to another format would just make it more complex and slower.

Index and mail data is very different. Index data is accessed constantly and it must be very low latency or performance will be horrible. It practically should be in memory in local machine and there shouldn't normally be any network lookups when accessing it.

Ok, I lied. I'm going to start something new.

Do the indexes contain any of the header information? In particular, since I know nothing of the communication between IMAP clients & servers in general, is the information that is shown in typical client mail lists (subject, sender, date, etc.) stored in the indexes? I guess I'm asking if any planned changes will have an impact in retrieving message lists in any way.

-- Daniel

Timo Sirainen

6:34 a.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On Aug 12, 2009, at 6:54 PM, Daniel L. Miller wrote:

...

Do the indexes contain any of the header information?

Yes.

...

In particular, since I know nothing of the communication between
IMAP clients & servers in general, is the information that is shown
in typical client mail lists (subject, sender, date, etc.) stored in
the indexes?

Yes. Dovecot adds to cache file those headers that the client requests.

...

I guess I'm asking if any planned changes will have an impact in
retrieving message lists in any way.

Usually not. Unless client fetches the entire header. Some do I think,
but usually not.

Timo Sirainen

10 Mar 10 Mar

11:19 p.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

On 10.8.2009, at 20.01, Timo Sirainen wrote:

...

(3.5. Implement async I/O filesystem backend.)

You know what I found out today? Linux doesn't support async IO for regular buffered files. I had heard there were issues, but I thought it was mainly about some annoying APIs and such. Anyone know if some project has successfully figured out some usable way to do async disk IO? The possibilities seem to be:

a) Use Linux's native AIO, which requires direct-io for files. This *might* not be horribly bad for mail files. After all, same mail is rarely read multiple times. Except when parsing its headers first and then its body. Maybe the process could do some internal buffering?..

I guess no one ever tried my posix_fadvise() patch? The idea was that it would tell the kernel after closing a mail file that it's no longer needed in memory, so kernel could remove it from page cache. I never heard any positive or negative comments about how it affected performance.. http://dovecot.org/patches/1.1/fadvise.diff

b) Use threads, either via some library or implement yourself. Each thread of course uses some extra memory. Also enabling threads causes glibc to start using a thread-safe version of malloc() (I think?), which slows things down (unless that can be avoided, maybe by using clone() directly instead of pthreads?).

c) I read someone's idea about using posix_fadvise() and fincore() functions to somehow make it "kind of work, usually, maybe". I'm not sure if there's a practical way to make them work though. And of course I don't think fincore() has even been accepted by Linus yet.

Stan Hoeppner

11 Mar 11 Mar

1:48 a.m.

Timo Sirainen put forth on 3/10/2010 3:19 PM:

...

On 10.8.2009, at 20.01, Timo Sirainen wrote:

...
(3.5. Implement async I/O filesystem backend.)

You know what I found out today? Linux doesn't support async IO for regular buffered files. I had heard there were issues, but I thought it was mainly about some annoying APIs and such. Anyone know if some project has successfully figured out some usable way to do async disk IO? The possibilities seem to be:

a) Use Linux's native AIO, which requires direct-io for files. This *might* not be horribly bad for mail files. After all, same mail is rarely read multiple times. Except when parsing its headers first and then its body. Maybe the process could do some internal buffering?..

I guess no one ever tried my posix_fadvise() patch? The idea was that it would tell the kernel after closing a mail file that it's no longer needed in memory, so kernel could remove it from page cache. I never heard any positive or negative comments about how it affected performance.. http://dovecot.org/patches/1.1/fadvise.diff

b) Use threads, either via some library or implement yourself. Each thread of course uses some extra memory. Also enabling threads causes glibc to start using a thread-safe version of malloc() (I think?), which slows things down (unless that can be avoided, maybe by using clone() directly instead of pthreads?).

c) I read someone's idea about using posix_fadvise() and fincore() functions to somehow make it "kind of work, usually, maybe". I'm not sure if there's a practical way to make them work though. And of course I don't think fincore() has even been accepted by Linus yet.

Considering the extent to which Linus hates O_DIRECT, I would think if he was a fan of Async I/O at all, he'd have pushed its use via the buffer cache. Given that Async I/O is implemented via O_DIRECT I'd say Linus isn't a fan of Async I/O either. I've not read anything Linus has written on Async I/O, if he even has, I'm merely making an educated guess based on the current implementation of Async I/O in Linux.

-- Stan

Ed W

4:22 p.m.

On 10/03/2010 21:19, Timo Sirainen wrote:

...

On 10.8.2009, at 20.01, Timo Sirainen wrote:

...
(3.5. Implement async I/O filesystem backend.)

You know what I found out today? Linux doesn't support async IO for regular buffered files. I had heard there were issues, but I thought it was mainly about some annoying APIs and such. Anyone know if some project has successfully figured out some usable way to do async disk IO? The possibilities seem to be:

a) Use Linux's native AIO, which requires direct-io for files. This *might* not be horribly bad for mail files. After all, same mail is rarely read multiple times. Except when parsing its headers first and then its body. Maybe the process could do some internal buffering?..

I guess no one ever tried my posix_fadvise() patch? The idea was that it would tell the kernel after closing a mail file that it's no longer needed in memory, so kernel could remove it from page cache. I never heard any positive or negative comments about how it affected performance.. http://dovecot.org/patches/1.1/fadvise.diff

b) Use threads, either via some library or implement yourself. Each thread of course uses some extra memory. Also enabling threads causes glibc to start using a thread-safe version of malloc() (I think?), which slows things down (unless that can be avoided, maybe by using clone() directly instead of pthreads?).

c) I read someone's idea about using posix_fadvise() and fincore() functions to somehow make it "kind of work, usually, maybe". I'm not sure if there's a practical way to make them work though. And of course I don't think fincore() has even been accepted by Linus yet.

Perhaps mail this question to the kernel list, stand back and watch it ignite?

Sebastian Färber

12 Mar 12 Mar

9:01 a.m.

New subject: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

...

b) Use threads, either via some library or implement yourself. Each thread of course uses some extra memory. Also enabling >threads causes glibc to start using a thread-safe version of malloc() (I think?), which slows things down (unless that can be avoided, >maybe by using clone() directly instead of pthreads?).

Perhaps libeio (http://software.schmorp.de/pkg/libeio.html) is a good starting point? I don't have any experience with it but it's used by node.js (http://nodejs.org/) for the async I/O stuff.

-Sebastian

5640

Age (days ago)

5854

Last active (days ago)

List overview

82 comments

17 participants

participants (17)

Charles Marcus
Curtis Maloney
Daniel L. Miller
Ed W
Eric Jon Rostetter
Eric Rostetter
ja nein
Jeff Grossman
Patrick Nagel
paulmon
Robert Schetterer
Sebastian Färber
Seth Mattinen
Stan Hoeppner
Steffen Kaiser
Steve
Timo Sirainen