librmb: Mail storage on RADOS with Dovecot

Peter Mauritius p.mauritius at tallence.com
Mon Sep 25 18:05:02 EEST 2017


Hi Timo,

I am one of the authors of the software Wido announced in his mail. First, I'd like to say that Dovecot is a wonderful piece of software and thank you for it. I would like to give some explanations regarding the design we choose.

Von: Timo Sirainen <tss at iki.fi><mailto:tss at iki.fi>
Antworten: Dovecot Mailing List <dovecot at dovecot.org><mailto:dovecot at dovecot.org>
Datum: 24. September 2017 at 02:43:44
An: Dovecot Mailing List <dovecot at dovecot.org><mailto:dovecot at dovecot.org>
Betreff:  Re: librmb: Mail storage on RADOS with Dovecot

It would be have been nicer if RADOS support was implemented as lib-fs driver, and the fs-API had been used all over the place elsewhere. So 1) LibRadosMailBox wouldn't have been relying so much on RADOS specifically and 2) fs-rados could have been used for other purposes. There are already fs-dict and dict-fs drivers, so the RADOS dict driver may not have been necessary to implement if fs-rados was implemented instead (although I didn't check it closely enough to verify). (We've had fs-rados on our TODO list for a while also.)

Actually I considered using the fs-api to build a RADOS driver. But I did not follow that path:

The dict-fs mapping is quite simplistic. For example, I would not be able to use RADOS read/write operations to batch request or model the dictionary transactions.  Also there is no async support if you hide the RADOS dictionary behind as fs-api module, which would make the use of dict-rados in the dict-proxy harder. Doing this would help to lower the price you have to pay for the process model Dovecot ist using a lot.

Using a fs-rados module behing a storage module, let’s say sdbox, would IMO not fit to our goals. We planned to store mails in RADOS object and their (immutable) metadata in RADOS omap K/V. We want to be able to access the objects without Dovecot. This is not possible if RADOS is hidden behind a fs-rados module. The format of the stored objects would be different and depended on the storage module sitting in front of fs-rados.
Another reason is that at the fs level the operations are to decomposed. We would not have any, as with the dictionaries, transactional contexts etc. This context information allows us to use the RADOS operations in an optimized way. The storage API is IMO the right level of abstraction. Especially if we follow our long term goal to eliminate the fs needs for index data to. I like the internal abstraction of sdbox/mdbox a lot. But for our purpose it should have been on mail and not file level.

But building a fs-rados should not be very hard.

BTW. We've also been planning on open sourcing some of the obox pieces, mainly fs-drivers (e.g. fs-s3). The obox format maybe too, but without the "metacache" piece. The current obox code is a bit too much married into the metacache though to make open sourcing it easy. (The metacache is about storing the Dovecot index files in object storage and efficiently caching them on local filesystem, which isn't planned to be open sourced in near future. That's pretty much the only difficult piece of the obox plugin, with Cassandra integration coming as a good second. I wish there had been a better/easier geo-distributed key-value database to use - tombstones are annoyingly troublesome.)


That would be great.

And using rmb-mailbox format, my main worries would be:
* doesn't store index files (= message flags) - not necessarily a problem, as long as you don't want geo-replication

Your index management is awesome, highly optimized and not easily reimplemented. Very nice work. Unfortunately it is not using the fs-api and therefore not capable of being located on not fs storage. We are believing that CephFS will be a good and stable solution for the next time. Of course it would be nicer to have a lib-index that allows us to plug in different backends.

* index corruption means rebuilding them, which means rescanning list of mail files, which means rescanning the whole RADOS namespace, which practically means rescanning the RADOS pool. That most likely is a very very slow operation, which you want to avoid unless it's absolutely necessary. Need to be very careful to avoid that happening, and in general to avoid losing mails in case of crashes or other bugs.

Yes, disaster is a problem. We are trying to build as many rescue tools as possible but in the end scanning mails is involved. All mails are stored within separate RADOS namespaces each representing a different user. This will help us to avoid scanning the whole pool. But it this not should not be a regular operation. You are right.

* I think copying/moving mails physically copies the full data on disk

We tried to optimize this. Moves within a users mailboxes are done without copying the mails by just changing the index data. Copies, when really necessary, are done be native RADOS commands (OSD to OSD) without transferring the data to the client and back. There is potential for even more optimization. We could build a mechanism similar to the mdbox reference counters to reduce copying. I am sure we will give it a try in a later version.

* Each IMAP/POP3/LMTP/etc process connects to RADOS separately from each others - some connection pooling would likely help here

Dovecot is using separate processes a lot. You are right that this is a problem for protocols/libraries that have a high setup cost. You build some mechanisms like login process reuse or the dict-proxy to overcome that problem.

Ceph is a low latency object store. One reason of the speed of Ceph is the fact that the cluster structure is known to the clients. The clients have a direct connection to the OSD that hosts the object they are looking for. If we place any intermediaries between the client process and the OSD (like with the dict-proxy) the performance will suffer.

IMO the processes you mentioned should be reused to reduce the setup cost per session (or implemented multithreaded or async). I am aware of the fact that this might be a potential security risk.

Right now we do not know the price for the connection setup in a real cluster in a Dovecot context. We are curious about the results of the tests with Danny's cluster and will change the design of the software to get the best results of it if necessary.

Best regards

Peter



More information about the dovecot mailing list