[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Wed Aug 12 18:26:25 EEST 2009

Hi

>  * Mail data on the other hand is just written once and usually read
> maybe once or a couple of times. Caching mail data in memory probably
> doesn't help all that much. Latency isn't such a horrible issue as long
> as multiple mails can be fetched at once / in parallel, so there's only
> a single latency wait.
>   

This logically seems correct.  Couple of questions then:

1) Since latency requirements are low, why did performance drop so much 
previously when you implemented a very simple mysql storage backend?  I 
glanced at the code a few weeks ago and whilst it's surprisingly 
complicated right now to implement a backend, I was also surprised that 
a database storage engine "sucked" I think you phrased it? Possibly the 
code also placed the indexes on the DB? Certainly this could very well 
kill performance?  (Note I'm not arguing sql storage is a good thing, I 
just want to understand the latency to backend requirements)

2) I would be thinking that with some care, even very high latency 
storage would be workable, eg S3/Gluster/MogileFs ?  I would love to see 
a backend using S3 - If nothing else I think it would quickly highlight 
all the bottlenecks in any design...

> 4. Implement a multi-master filesystem backend for index files. The idea
> would be that all servers accessing the same mailbox must be talking to
> each others via network and every time something is changed, push the
> change to other servers. This is actually very similar to my previous
> multi-master plan. One of the servers accessing the mailbox would still
> act as a master and handle conflict resolution and writing indexes to
> disk more or less often.
>   

Take a look at Mogilefs for some ideas here.  I doubt it's a great fit, 
but they certainly need to solve a lot of the same problems

> 5. Implement filesystem backend for dbox and permanent index storage
> using some scalable distributed database, such as maybe Cassandra.

CouchDB?  It is just the Lotus Notes database after all, and personally 
I have built some *amazing* applications using that as the backend. (I 
just love the concept of Notes - the gui is another matter...)

Note that CouchDB is interesting in that it is multi-master with 
"eventual" synchronisation.  This potentially has some interesting 
issues/benefits for offline use

For the filesystem backend have you looked at the various log structured 
filesystems appearing?  Whenever I watch the debate between Maildir vs 
Mailbox I always think that a hybrid is the best solution because you 
are optimising for a write one, read many situation, where you have an 
increased probability of having good cache localisation on any given read.

To me this ends up looking like log structured storage... (which feels 
like a hybrid between maildir/mailbox)

>  * Scalability, of course. It'll be as scalable as the distributed
> database being used to store mails.
>   

I would be very interested to see a kind of "where the time goes" 
benchmark of dovecot.  Have you measured and found that latency of this 
part accounts for x% of the response time and CPU bound here is another 
y%, etc?  eg if you deliberately introduce X ms of latency in the index 
lookups, what does that do to the response time of the system once the 
cache warms up?  What about if the response time to the storage backend 
changes?  I would have thought this would help you determine how to 
scale this thing?

All in all sounds very interesting.  However, couple of thoughts:

- What is the goal?
- If the goal is performance by allowing a scale-out in quantity of 
servers then I guess you need to measure it carefully to make sure it 
actually works? I haven't had the fortune to develop something that big, 
but the general advice is that scaling out is hard to get right, so 
assume you made a mistake in your design somewhere... Measure, measure
- If the goal is reliability then I guess it's prudent to assume that 
somehow all servers will get out of sync (eventually). It's definitely 
nice if they are self repairing as a design goal, eg the difference 
between a full sync and shipping logs (I ship logs to have a 
master-master mysql server, but if we have a crash then I use a sync 
program (maatkit) to check the two servers are in sync and avoid 
recreating one of the servers from fresh)
- If the goal is increased storage capacity on commodity hardware then 
it needs a useful bunch of tools to manage the replication and make sure 
there is redundancy and it's easy to find the required storage. I guess 
look at Mogilefs, if you think you can do better then at least remember 
it was quite hard work to get to that stage, so doing it again is likely 
to be non trivial?
- If the goal were making it simpler to build a backend storage engine 
then this would be excellent - I find myself wanting to benchmark ideas 
like S3 or sticking things in a database, but I looked at the API 
recently and it's going to require a bit of investment to get started - 
certainly more than a couple of evenings poking around...  Hopefully 
others would write interesting backends, regardless of whether it's 
sensible to use them on high performance setups, some folks simply 
want/need to do unusual things...

- Finally I am a bit sad that offline distributed multi-master isn't in 
the roadmap anymore... :-(  - My situation is we have a lot of boats 
boating around with intermittent expensive satellite connections and the 
users are fluid and need to get access to their data from land and 
different vessels.  Currently we build software inhouse to make this 
possible, but it would be fantastic to see more features enabling this 
on the server side (CouchDB / Lotus Notes is cool...)

Good luck - sounds fun implementing all this anyway!

Ed W