[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Wed Aug 12 19:12:25 EEST 2009

On Aug 12, 2009, at 11:26 AM, Ed W wrote:

> Hi
>
>> * Mail data on the other hand is just written once and usually read
>> maybe once or a couple of times. Caching mail data in memory probably
>> doesn't help all that much. Latency isn't such a horrible issue as  
>> long
>> as multiple mails can be fetched at once / in parallel, so there's  
>> only
>> a single latency wait.
>>
>
> This logically seems correct.  Couple of questions then:
>
> 1) Since latency requirements are low, why did performance drop so  
> much previously when you implemented a very simple mysql storage  
> backend?  I glanced at the code a few weeks ago and whilst it's  
> surprisingly complicated right now to implement a backend, I was  
> also surprised that a database storage engine "sucked" I think you  
> phrased it? Possibly the code also placed the indexes on the DB?  
> Certainly this could very well kill performance?  (Note I'm not  
> arguing sql storage is a good thing, I just want to understand the  
> latency to backend requirements)

Yes, it placed indexes also to SQL. That's slow. But even without it,  
Dovecot code needs to be changed to access more mails in parallel  
before the performance can be good for high-latency mail storages.

> 2) I would be thinking that with some care, even very high latency  
> storage would be workable, eg S3/Gluster/MogileFs ?  I would love to  
> see a backend using S3 - If nothing else I think it would quickly  
> highlight all the bottlenecks in any design...

Yes, S3 should be possible. With dbox it could even be used to store  
the old mails and keep new mails in lower latency storage.

>> 5. Implement filesystem backend for dbox and permanent index storage
>> using some scalable distributed database, such as maybe Cassandra.
>
> CouchDB?  It is just the Lotus Notes database after all, and  
> personally I have built some *amazing* applications using that as  
> the backend. (I just love the concept of Notes - the gui is another  
> matter...)
>
> Note that CouchDB is interesting in that it is multi-master with  
> "eventual" synchronisation.  This potentially has some interesting  
> issues/benefits for offline use

CouchDB seems like it would still be more difficult than necessary to  
scale. I'd really just want something that distributes the load and  
disk usage evenly across all servers and allows easily plugging in  
more servers and it automatically rebalances the load. CouchDB seems  
like much of that would have to be done manually (or building scripts  
to do it).

> For the filesystem backend have you looked at the various log  
> structured filesystems appearing?  Whenever I watch the debate  
> between Maildir vs Mailbox I always think that a hybrid is the best  
> solution because you are optimising for a write one, read many  
> situation, where you have an increased probability of having good  
> cache localisation on any given read.
>
> To me this ends up looking like log structured storage... (which  
> feels like a hybrid between maildir/mailbox)

Hmm. I don't really see how it looks like log structured storage.. But  
you do know that multi-dbox is kind of a maildir/mbox hybrid, right?

>> * Scalability, of course. It'll be as scalable as the distributed
>> database being used to store mails.
>>
>
> I would be very interested to see a kind of "where the time goes"  
> benchmark of dovecot.  Have you measured and found that latency of  
> this part accounts for x% of the response time and CPU bound here is  
> another y%, etc?  eg if you deliberately introduce X ms of latency  
> in the index lookups, what does that do to the response time of the  
> system once the cache warms up?  What about if the response time to  
> the storage backend changes?  I would have thought this would help  
> you determine how to scale this thing?

I haven't really done any explicit benchmarks, but there are a few  
reasons why I think low-latency for indexes is really important:

  * All commands that access mails in any ways need to first do index  
lookup first to find the mail.

  * Anything using IMAP UIDs need to do a binary search on the index  
to find the mail.

  * Anything accessing mail metadata needs to do dovecot.index.cache  
lookups, often many of them. For example FETCH ENVELOPE does something  
like 10 lookups to cache for each mail.

  * After each command Dovecot needs to check if there are new mails  
by checking if dovecot.index.log has changed.

I think it's pretty obvious that if any of those lookups had latency  
the performance would soon become pretty horrible. And the reasons why  
I think the actual mail storage can live with high latency:

  * Whenever processing a command, Dovecot knows beforehand what kind  
of data it needs. It can quickly go through index/cache file to find  
out what message contents it needs to have, and then send requests to  
all of those immediately. (Or if there are hundreds, maybe always have  
something like 20 queued, or whatever is good.) After the first one  
has arrived, the rest should already be available immediately then for  
access.

  * That first initial latency hit is a bit bad, but it probably isn't  
horrible. Gmail IMAP seems to do ok with pretty high latencies..

  * If message data lives in multiple servers, commands that access a  
large number of mails can run faster since the data can be fetched  
from multiple servers in parallel, so there's less disk I/O wait.

And why I don't really care much about CPU bottlenecks: As far as I  
know, there aren't any. CPU load is typically close to 0%.

> All in all sounds very interesting.  However, couple of thoughts:
>
> - What is the goal?
> - If the goal is performance by allowing a scale-out in quantity of  
> servers then I guess you need to measure it carefully to make sure  
> it actually works? I haven't had the fortune to develop something  
> that big, but the general advice is that scaling out is hard to get  
> right, so assume you made a mistake in your design somewhere...  
> Measure, measure

I don't think it's all that much about performance of a single user,  
but more about distributing the load more evenly in an easier way.  
That's basically done by outsourcing the problem to the underlying  
storage (database).

> - If the goal is reliability then I guess it's prudent to assume  
> that somehow all servers will get out of sync (eventually). It's  
> definitely nice if they are self repairing as a design goal, eg the  
> difference between a full sync and shipping logs (I ship logs to  
> have a master-master mysql server, but if we have a crash then I use  
> a sync program (maatkit) to check the two servers are in sync and  
> avoid recreating one of the servers from fresh)

Yes, resolving conflicts due to split brain merging back is something  
I really want to make work as well as it can. The backend database can  
hopefully again help here (by noticing there was a conflict and  
allowing the program to resolve it).

> - If the goal is increased storage capacity on commodity hardware  
> then it needs a useful bunch of tools to manage the replication and  
> make sure there is redundancy and it's easy to find the required  
> storage. I guess look at Mogilefs, if you think you can do better  
> then at least remember it was quite hard work to get to that stage,  
> so doing it again is likely to be non trivial?

This is again something I'm hoping to outsource to the backend database.

> - If the goal were making it simpler to build a backend storage  
> engine then this would be excellent - I find myself wanting to  
> benchmark ideas like S3 or sticking things in a database, but I  
> looked at the API recently and it's going to require a bit of  
> investment to get started - certainly more than a couple of evenings  
> poking around...  Hopefully others would write interesting backends,  
> regardless of whether it's sensible to use them on high performance  
> setups, some folks simply want/need to do unusual things...

This is also one of its goals :) Even if I make a mistake in choosing  
a bad database first, it should be somewhat easy to implement another  
backend again. The backend FS API will be pretty simple. Basically  
it's going to be:

1) fd = open(path, mode) where mode is one of
- recreate atomically once writing is finished (for recreating  
dovecot.index)
- create atomically, fail if it already exists (for creating new files)
- append to existing file

2) read(fd, offset, size)

3) write(fd, data) - not visible to others until flush() is called

4) flush(fd, &offset) - if offset is specified, it's set to the offset  
where data was actually written to.

5) unallocate(fd, size) - to free disk space from beginning of file

Then perhaps some kind of readdir() for listing mailboxes, but I  
haven't thought of that yet. Since there's no practical way to do  
unallocate() with POSIX, it can be done by creating a couple of  
different files and rotating them (the way dovecot.index.log is done).

I'll probably write a more detailed explanation how this is going to  
work at some point. Although there are a couple of details I'd still  
like to improve.

> - Finally I am a bit sad that offline distributed multi-master isn't  
> in the roadmap anymore... :-(

I think dsync can do that. It'll do two-way syncing between Dovecots  
and resolves all conflicts. Is the syncing itself still done with very  
high latencies, i.e. something like USB sticks? That's currently not  
really working, but it probably wouldn't be too difficult. The  
protocol is currently something like:

1. Get list of remote's mailboxes and mails.

2. Sync mailbox

3. Wait for "all messages saved ok" reply. Goto 2.

The 2 and 3 parts are done in both ways at the same time. The extra  
wait there is to allow some good way to handle COPY failures (dsync  
prefers to copy messages instead of re-sending them, if possible) and  
I think there were some other reasons too.

So for USB stick -like sync I guess it would need to be something like:

1. Read remote's mailbox list and highestmodseq values from a file.

2. Write changes based on modseqs to file, save each mail separately  
instead of using copy. (Or perhaps it could do some kind of COPY  
fallbacking. Send all possible COPYs followed by SAVE. So it wouldn't  
reduce the traffic, but it could reduce disk space if copying can be  
done by e.g. hard linking.)

3. Move the USB stick to the other machine.

4. It reads the file, applies changes, saves mailbox list and highest  
modseq values to file.