I would like to report that from version 10.2.3 on (next release), Ceph FS is working really well with Dovecot systems.
For those that don't know, Ceph is a "distributed object store and file system designed to provide excellent performance, reliability and scalability.". We have used it here since 2013 very successfully, but never with our Dovecot setup. For more information go on http://ceph.com/.
Since Ceph Jewel (the current version), Ceph FS is considered production ready by their team. With Ceph FS you have a cache-coherent POSIX-compliant [1] clustered file system, without most of the NFS shortcomings.
Ceph have very nice features like online upgrades, online maintenance, constant deep scrubbing of replicated data, cache tiering (HD -> SSD -> etc), erasure coding (clustered RAID6 for really old email - I'm not using), etc. Ceph is very complex to operate but very flexible and robust.
This year we moved our Dovecot servers to a Ceph based system, found one bug there (http://tracker.ceph.com/issues/15920) when Dovecot's LMTP was delivery an email, and the fix is about to be released on version 10.2.3. I have been using a fix-build here for a couple of months without issue. So, now I'm glad to share with you guys that it works really well!
My setup involves two clusters, each with about 30k-40k users. Each cluster will have two HD storages (with 6TB HDs), two SSD storages (with Intel 480GB SSDs) and two frontends. In a few months will add a third server of each type. Clusters work better with 3's.
Here we used mdbox as it is better performant on Ceph for maintenance tasks, since each file is an object on Ceph and maintenance costs increase with the number of objects. We created two base directories:
- /srv/dovecot/mail/%d/%n - stored on HDs with the most recent files cached on SSDs, thanks to Ceph Cache Tiering. Also, the directory structure itself is stored on SSDs, so dir listings are very fast (Ceph FS Metadata).
- /srv/dovecot/index/%d/%n - stored only on SSDs, thanks for Ceph FS file layout.
On our setup about 17% of the IOPs are going to HDs, the rest will go to SSDs, even though SSDs are less than 5% of the space. This is a matter of tuning the cache tiering parameters, but we didn't look at that yet.
That setup is working like a charm, performance is about 53% better than when we were using NFS on the same hardware. Our previous DRBD+Heartbeat setup didn't allow for online maintenance and had a few problems. Now we can do 100% online maintenance on storage without users noticing, and on frontends with just a reconnect but without any downtime.
Ceph is hard to learn at first but those with bigger setups and stronger SLAs will want to take a look at that. I really recommend that the Dovecot community take at look at that setup.
Good luck!
Best, Daniel Colchete
[1] http://docs.ceph.com/docs/hammer/dev/differences-from-posix/