Reporting on CephFS being ready to use with Dovecot

17 Aug 2016

      I would like to report that from version 10.2.3 on (next release), Ceph FS
is working really well with Dovecot systems.
For those that don't know, Ceph is a "distributed object store and file
system designed to provide excellent performance, reliability and
scalability.". We have used it here since 2013 very successfully, but never
with our Dovecot setup. For more information go on http://ceph.com/.
Since Ceph Jewel (the current version), Ceph FS is considered production
ready by their team. With Ceph FS you have a cache-coherent POSIX-compliant
[1] clustered file system, without most of the NFS shortcomings.
Ceph have very nice features like online upgrades, online maintenance,
constant deep scrubbing of replicated data, cache tiering (HD -> SSD ->
etc), erasure coding (clustered RAID6 for really old email - I'm not
using), etc. Ceph is very complex to operate but very flexible and robust.
This year we moved our Dovecot servers to a Ceph based system, found one
bug there (http://tracker.ceph.com/issues/15920) when Dovecot's LMTP was
delivery an email, and the fix is about to be released on version 10.2.3. I
have been using a fix-build here for a couple of months without issue. So,
now I'm glad to share with you guys that it works really well!
My setup involves two clusters, each with about 30k-40k users. Each cluster
will have two HD storages (with 6TB HDs), two SSD storages (with Intel
480GB SSDs) and two frontends. In a few months will add a third server of
each type. Clusters work better with 3's.
Here we used mdbox as it is better performant on Ceph for maintenance
tasks, since each file is an object on Ceph and maintenance costs increase
with the number of objects. We created two base directories:

/srv/dovecot/mail/%d/%n - stored on HDs with the most recent files
cached on SSDs, thanks to Ceph Cache Tiering. Also, the directory structure
itself is stored on SSDs, so dir listings are very fast (Ceph FS Metadata).
/srv/dovecot/index/%d/%n - stored only on SSDs, thanks for Ceph FS file
layout.

On our setup about 17% of the IOPs are going to HDs, the rest will go to
SSDs, even though SSDs are less than 5% of the space. This is a matter of
tuning the cache tiering parameters, but we didn't look at that yet.
That setup is working like a charm, performance is about 53% better than
when we were using NFS on the same hardware. Our previous DRBD+Heartbeat
setup didn't allow for online maintenance and had a few problems. Now we
can do 100% online maintenance on storage without users noticing, and on
frontends with just a reconnect but without any downtime.
Ceph is hard to learn at first but those with bigger setups and stronger
SLAs will want to take a look at that. I really recommend that the Dovecot
community take at look at that setup.
Good luck!
Best,
Daniel Colchete
[1] http://docs.ceph.com/docs/hammer/dev/differences-from-posix/

Reporting on CephFS being ready to use with Dovecot

Daniel van Ham Colchete