On Aug 12, 2009, at 11:26 AM, Ed W wrote:
Hi
- Mail data on the other hand is just written once and usually read maybe once or a couple of times. Caching mail data in memory probably doesn't help all that much. Latency isn't such a horrible issue as
long as multiple mails can be fetched at once / in parallel, so there's
only a single latency wait.This logically seems correct. Couple of questions then:
- Since latency requirements are low, why did performance drop so
much previously when you implemented a very simple mysql storage
backend? I glanced at the code a few weeks ago and whilst it's
surprisingly complicated right now to implement a backend, I was
also surprised that a database storage engine "sucked" I think you
phrased it? Possibly the code also placed the indexes on the DB?
Certainly this could very well kill performance? (Note I'm not
arguing sql storage is a good thing, I just want to understand the
latency to backend requirements)
Yes, it placed indexes also to SQL. That's slow. But even without it,
Dovecot code needs to be changed to access more mails in parallel
before the performance can be good for high-latency mail storages.
- I would be thinking that with some care, even very high latency
storage would be workable, eg S3/Gluster/MogileFs ? I would love to
see a backend using S3 - If nothing else I think it would quickly
highlight all the bottlenecks in any design...
Yes, S3 should be possible. With dbox it could even be used to store
the old mails and keep new mails in lower latency storage.
- Implement filesystem backend for dbox and permanent index storage using some scalable distributed database, such as maybe Cassandra.
CouchDB? It is just the Lotus Notes database after all, and
personally I have built some *amazing* applications using that as
the backend. (I just love the concept of Notes - the gui is another
matter...)Note that CouchDB is interesting in that it is multi-master with
"eventual" synchronisation. This potentially has some interesting
issues/benefits for offline use
CouchDB seems like it would still be more difficult than necessary to
scale. I'd really just want something that distributes the load and
disk usage evenly across all servers and allows easily plugging in
more servers and it automatically rebalances the load. CouchDB seems
like much of that would have to be done manually (or building scripts
to do it).
For the filesystem backend have you looked at the various log
structured filesystems appearing? Whenever I watch the debate
between Maildir vs Mailbox I always think that a hybrid is the best
solution because you are optimising for a write one, read many
situation, where you have an increased probability of having good
cache localisation on any given read.To me this ends up looking like log structured storage... (which
feels like a hybrid between maildir/mailbox)
Hmm. I don't really see how it looks like log structured storage.. But
you do know that multi-dbox is kind of a maildir/mbox hybrid, right?
- Scalability, of course. It'll be as scalable as the distributed database being used to store mails.
I would be very interested to see a kind of "where the time goes"
benchmark of dovecot. Have you measured and found that latency of
this part accounts for x% of the response time and CPU bound here is
another y%, etc? eg if you deliberately introduce X ms of latency
in the index lookups, what does that do to the response time of the
system once the cache warms up? What about if the response time to
the storage backend changes? I would have thought this would help
you determine how to scale this thing?
I haven't really done any explicit benchmarks, but there are a few
reasons why I think low-latency for indexes is really important:
All commands that access mails in any ways need to first do index
lookup first to find the mail.Anything using IMAP UIDs need to do a binary search on the index
to find the mail.Anything accessing mail metadata needs to do dovecot.index.cache
lookups, often many of them. For example FETCH ENVELOPE does something
like 10 lookups to cache for each mail.After each command Dovecot needs to check if there are new mails
by checking if dovecot.index.log has changed.
I think it's pretty obvious that if any of those lookups had latency
the performance would soon become pretty horrible. And the reasons why
I think the actual mail storage can live with high latency:
Whenever processing a command, Dovecot knows beforehand what kind
of data it needs. It can quickly go through index/cache file to find
out what message contents it needs to have, and then send requests to
all of those immediately. (Or if there are hundreds, maybe always have
something like 20 queued, or whatever is good.) After the first one
has arrived, the rest should already be available immediately then for
access.That first initial latency hit is a bit bad, but it probably isn't
horrible. Gmail IMAP seems to do ok with pretty high latencies..If message data lives in multiple servers, commands that access a
large number of mails can run faster since the data can be fetched
from multiple servers in parallel, so there's less disk I/O wait.
And why I don't really care much about CPU bottlenecks: As far as I
know, there aren't any. CPU load is typically close to 0%.
All in all sounds very interesting. However, couple of thoughts:
- What is the goal?
- If the goal is performance by allowing a scale-out in quantity of
servers then I guess you need to measure it carefully to make sure
it actually works? I haven't had the fortune to develop something
that big, but the general advice is that scaling out is hard to get
right, so assume you made a mistake in your design somewhere...
Measure, measure
I don't think it's all that much about performance of a single user,
but more about distributing the load more evenly in an easier way.
That's basically done by outsourcing the problem to the underlying
storage (database).
- If the goal is reliability then I guess it's prudent to assume
that somehow all servers will get out of sync (eventually). It's
definitely nice if they are self repairing as a design goal, eg the
difference between a full sync and shipping logs (I ship logs to
have a master-master mysql server, but if we have a crash then I use
a sync program (maatkit) to check the two servers are in sync and
avoid recreating one of the servers from fresh)
Yes, resolving conflicts due to split brain merging back is something
I really want to make work as well as it can. The backend database can
hopefully again help here (by noticing there was a conflict and
allowing the program to resolve it).
- If the goal is increased storage capacity on commodity hardware
then it needs a useful bunch of tools to manage the replication and
make sure there is redundancy and it's easy to find the required
storage. I guess look at Mogilefs, if you think you can do better
then at least remember it was quite hard work to get to that stage,
so doing it again is likely to be non trivial?
This is again something I'm hoping to outsource to the backend database.
- If the goal were making it simpler to build a backend storage
engine then this would be excellent - I find myself wanting to
benchmark ideas like S3 or sticking things in a database, but I
looked at the API recently and it's going to require a bit of
investment to get started - certainly more than a couple of evenings
poking around... Hopefully others would write interesting backends,
regardless of whether it's sensible to use them on high performance
setups, some folks simply want/need to do unusual things...
This is also one of its goals :) Even if I make a mistake in choosing
a bad database first, it should be somewhat easy to implement another
backend again. The backend FS API will be pretty simple. Basically
it's going to be:
- fd = open(path, mode) where mode is one of
- recreate atomically once writing is finished (for recreating
dovecot.index) - create atomically, fail if it already exists (for creating new files)
- append to existing file
read(fd, offset, size)
write(fd, data) - not visible to others until flush() is called
flush(fd, &offset) - if offset is specified, it's set to the offset
where data was actually written to.unallocate(fd, size) - to free disk space from beginning of file
Then perhaps some kind of readdir() for listing mailboxes, but I
haven't thought of that yet. Since there's no practical way to do
unallocate() with POSIX, it can be done by creating a couple of
different files and rotating them (the way dovecot.index.log is done).
I'll probably write a more detailed explanation how this is going to
work at some point. Although there are a couple of details I'd still
like to improve.
- Finally I am a bit sad that offline distributed multi-master isn't
in the roadmap anymore... :-(
I think dsync can do that. It'll do two-way syncing between Dovecots
and resolves all conflicts. Is the syncing itself still done with very
high latencies, i.e. something like USB sticks? That's currently not
really working, but it probably wouldn't be too difficult. The
protocol is currently something like:
Get list of remote's mailboxes and mails.
Sync mailbox
Wait for "all messages saved ok" reply. Goto 2.
The 2 and 3 parts are done in both ways at the same time. The extra
wait there is to allow some good way to handle COPY failures (dsync
prefers to copy messages instead of re-sending them, if possible) and
I think there were some other reasons too.
So for USB stick -like sync I guess it would need to be something like:
Read remote's mailbox list and highestmodseq values from a file.
Write changes based on modseqs to file, save each mail separately
instead of using copy. (Or perhaps it could do some kind of COPY
fallbacking. Send all possible COPYs followed by SAVE. So it wouldn't
reduce the traffic, but it could reduce disk space if copying can be
done by e.g. hard linking.)Move the USB stick to the other machine.
It reads the file, applies changes, saves mailbox list and highest
modseq values to file.