[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Wed Aug 12 19:46:29 EEST 2009

>> 1) Since latency requirements are low, why did performance drop so 
>> much previously when you implemented a very simple mysql storage 
>> backend?  I glanced at the code a few weeks ago and whilst it's 
>> surprisingly complicated right now to implement a backend, I was also 
>> surprised that a database storage engine "sucked" I think you phrased 
>> it? Possibly the code also placed the indexes on the DB? Certainly 
>> this could very well kill performance?  (Note I'm not arguing sql 
>> storage is a good thing, I just want to understand the latency to 
>> backend requirements)
>
> Yes, it placed indexes also to SQL. That's slow. But even without it, 
> Dovecot code needs to be changed to access more mails in parallel 
> before the performance can be good for high-latency mail storages.

My expectation then is that with local index and sql message storage 
that the performance should be very reasonable for a large class of 
users... (ok, other problems perhaps arise)

>> 2) I would be thinking that with some care, even very high latency 
>> storage would be workable, eg S3/Gluster/MogileFs ?  I would love to 
>> see a backend using S3 - If nothing else I think it would quickly 
>> highlight all the bottlenecks in any design...
>
> Yes, S3 should be possible. With dbox it could even be used to store 
> the old mails and keep new mails in lower latency storage.

Mogile doesn't handle S3, but I always thought it would be terrific to 
be able to have one copy of your data on fast local storage, but to be 
able to use slower (sometimes cheaper) storage for backups or less 
valuable data (eg older messages), ie replicating some data to other 
storage boxes

> CouchDB seems like it would still be more difficult than necessary to 
> scale. I'd really just want something that distributes the load and 
> disk usage evenly across all servers and allows easily plugging in 
> more servers and it automatically rebalances the load. CouchDB seems 
> like much of that would have to be done manually (or building scripts 
> to do it).

Ahh fair enough - I thought it being massively multi-master would allow 
simply querying different machines for different users.  Not a perfect 
scale-out, but good enough for a whole class of requirements...

>> For the filesystem backend have you looked at the various log 
>> structured filesystems appearing?  Whenever I watch the debate 
>> between Maildir vs Mailbox I always think that a hybrid is the best 
>> solution because you are optimising for a write one, read many 
>> situation, where you have an increased probability of having good 
>> cache localisation on any given read.
>>
>> To me this ends up looking like log structured storage... (which 
>> feels like a hybrid between maildir/mailbox)
>
> Hmm. I don't really see how it looks like log structured storage.. But 
> you do know that multi-dbox is kind of a maildir/mbox hybrid, right?

Well the access is largely append only, with some deletes and noise at 
the writing end, but largely the older storage stays static with much 
longer gaps between deletes (and extremely infrequent edits)

So maildir is optimised really for deletes, but improves random access 
for a subset of operations.  Mailbox is optimised for writes and seems 
like it's generally fast for most operations except deletes (people do 
worry about having a lot of eggs in one basket, but I think this is 
really a symptom of other problems at work).  Mailbox also has improved 
packing for small messages and probably has improved cache locality on 
certain read patterns

So one obvious hybrid would be a mailbox type structure which perhaps 
splits messages up into variable sized sub mailboxes based on various 
criteria, perhaps including message age, type of message or message 
size...?  The rapid write delete would happen at the head, perhaps even 
as a maildir layout and gradually the storage would become larger and 
ever more compressed mailboxes as the age/frequency of access/etc declines.

Perhaps this is exactly dbox?

It would also be interesting to consier separate message headers from 
body content.  Have heavy localisation of message headers, and slower 
higher latency access to the message body.  Would this improve access 
speeds in general?  Also the mime structure could be torn apart to store 
attachments individually - the motivation being single instance storage 
of large attachments with identical content...  Anyway, these seem like 
very speculative directions...

>
> I haven't really done any explicit benchmarks, but there are a few 
> reasons why I think low-latency for indexes is really important:

I think low latency for indexes is a given.  You appear to have 
architected the system so that all responses are delivered from the 
index and baring an increase in index efficiency the remaining time is 
spent doing the initial generation and maintenance of those indexes.  I 
would have thought bar downloading an entire INBOX that the access time 
of individual mails was very much secondary?

>> - If the goal is performance by allowing a scale-out in quantity of 
>> servers then I guess you need to measure it carefully to make sure it 
>> actually works? I haven't had the fortune to develop something that 
>> big, but the general advice is that scaling out is hard to get right, 
>> so assume you made a mistake in your design somewhere... Measure, 
>> measure
>
> I don't think it's all that much about performance of a single user, 
> but more about distributing the load more evenly in an easier way. 
> That's basically done by outsourcing the problem to the underlying 
> storage (database).

So perhaps something like CouchDB can work then?  One user localises per 
replica and you keep reusing that replica?

> Yes, resolving conflicts due to split brain merging back is something 
> I really want to make work as well as it can. The backend database can 
> hopefully again help here (by noticing there was a conflict and 
> allowing the program to resolve it).

In general conflict resolution is thrown back to the application, so 
likely this is going to become a dovecot problem.  It seems that the 
general class of problem is too hard to solve at the storage side

> This is also one of its goals :) Even if I make a mistake in choosing 
> a bad database first, it should be somewhat easy to implement another 
> backend again. The backend FS API will be pretty simple. Basically 
> it's going to be:

I wouldn't get too held back by posix semantics.  For sure they are 
memorable, but definitely consider that transactions are the key to any 
kind of database performance improvement and make sure you can batch 
together stuff to make good use of the backend.  Your "flush" command 
seems to be the implicit end of transaction, but I guess give it plenty 
of thought that you might have a super slow system (eg S3) and the 
backend might want the flexibility to mark something "kind of done", 
while uploading for 30 seconds in the background, then marking it 
properly done once S3 actually acks the data saved?

>> - Finally I am a bit sad that offline distributed multi-master isn't 
>> in the roadmap anymore... :-(
>
> I think dsync can do that. It'll do two-way syncing between Dovecots 
> and resolves all conflicts. Is the syncing itself still done with very 
> high latencies, i.e. something like USB sticks? That's currently not 
> really working, but it probably wouldn't be too difficult.

What is dsync?  There is a dsync.org which is some kind of directory 
synchroniser?

Aha, google suggests that I might have missed an email from you 
recently... Will read up...

OK, this sounds like a better implementation of the kind of thing we are 
building here - likely this is the way ahead!

Cheers

Ed W