Re: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

12 Aug 2009 · *amazing*

      Hi
...

Mail data on the other hand is just written once and usually read
maybe once or a couple of times. Caching mail data in memory probably
doesn't help all that much. Latency isn't such a horrible issue as long
as multiple mails can be fetched at once / in parallel, so there's only
a single latency wait.

This logically seems correct.  Couple of questions then:

Since latency requirements are low, why did performance drop so much
previously when you implemented a very simple mysql storage backend?  I
glanced at the code a few weeks ago and whilst it's surprisingly
complicated right now to implement a backend, I was also surprised that
a database storage engine "sucked" I think you phrased it? Possibly the
code also placed the indexes on the DB? Certainly this could very well
kill performance?  (Note I'm not arguing sql storage is a good thing, I
just want to understand the latency to backend requirements)

I would be thinking that with some care, even very high latency
storage would be workable, eg S3/Gluster/MogileFs ?  I would love to see
a backend using S3 - If nothing else I think it would quickly highlight
all the bottlenecks in any design...

...

Implement a multi-master filesystem backend for index files. The idea
would be that all servers accessing the same mailbox must be talking to
each others via network and every time something is changed, push the
change to other servers. This is actually very similar to my previous
multi-master plan. One of the servers accessing the mailbox would still
act as a master and handle conflict resolution and writing indexes to
disk more or less often.

Take a look at Mogilefs for some ideas here.  I doubt it's a great fit,
but they certainly need to solve a lot of the same problems
...

Implement filesystem backend for dbox and permanent index storage
using some scalable distributed database, such as maybe Cassandra.

CouchDB?  It is just the Lotus Notes database after all, and personally
I have built some *amazing* applications using that as the backend. (I
just love the concept of Notes - the gui is another matter...)
Note that CouchDB is interesting in that it is multi-master with
"eventual" synchronisation.  This potentially has some interesting
issues/benefits for offline use
For the filesystem backend have you looked at the various log structured
filesystems appearing?  Whenever I watch the debate between Maildir vs
Mailbox I always think that a hybrid is the best solution because you
are optimising for a write one, read many situation, where you have an
increased probability of having good cache localisation on any given read.
To me this ends up looking like log structured storage... (which feels
like a hybrid between maildir/mailbox)
...

Scalability, of course. It'll be as scalable as the distributed
database being used to store mails.

I would be very interested to see a kind of "where the time goes"
benchmark of dovecot.  Have you measured and found that latency of this
part accounts for x% of the response time and CPU bound here is another
y%, etc?  eg if you deliberately introduce X ms of latency in the index
lookups, what does that do to the response time of the system once the
cache warms up?  What about if the response time to the storage backend
changes?  I would have thought this would help you determine how to
scale this thing?
All in all sounds very interesting.  However, couple of thoughts:

What is the goal?

If the goal is performance by allowing a scale-out in quantity of
servers then I guess you need to measure it carefully to make sure it
actually works? I haven't had the fortune to develop something that big,
but the general advice is that scaling out is hard to get right, so
assume you made a mistake in your design somewhere... Measure, measure

If the goal is reliability then I guess it's prudent to assume that
somehow all servers will get out of sync (eventually). It's definitely
nice if they are self repairing as a design goal, eg the difference
between a full sync and shipping logs (I ship logs to have a
master-master mysql server, but if we have a crash then I use a sync
program (maatkit) to check the two servers are in sync and avoid
recreating one of the servers from fresh)

If the goal is increased storage capacity on commodity hardware then
it needs a useful bunch of tools to manage the replication and make sure
there is redundancy and it's easy to find the required storage. I guess
look at Mogilefs, if you think you can do better then at least remember
it was quite hard work to get to that stage, so doing it again is likely
to be non trivial?

If the goal were making it simpler to build a backend storage engine
then this would be excellent - I find myself wanting to benchmark ideas
like S3 or sticking things in a database, but I looked at the API
recently and it's going to require a bit of investment to get started -
certainly more than a couple of evenings poking around...  Hopefully
others would write interesting backends, regardless of whether it's
sensible to use them on high performance setups, some folks simply
want/need to do unusual things...

Finally I am a bit sad that offline distributed multi-master isn't in
the roadmap anymore... :-(  - My situation is we have a lot of boats
boating around with intermittent expensive satellite connections and the
users are fluid and need to get access to their data from land and
different vessels.  Currently we build software inhouse to make this
possible, but it would be fantastic to see more features enabling this
on the server side (CouchDB / Lotus Notes is cool...)

Good luck - sounds fun implementing all this anyway!
Ed W