Reporting on CephFS being ready to use with Dovecot
I would like to report that from version 10.2.3 on (next release), Ceph FS is working really well with Dovecot systems.
For those that don't know, Ceph is a "distributed object store and file system designed to provide excellent performance, reliability and scalability.". We have used it here since 2013 very successfully, but never with our Dovecot setup. For more information go on http://ceph.com/.
Since Ceph Jewel (the current version), Ceph FS is considered production ready by their team. With Ceph FS you have a cache-coherent POSIX-compliant [1] clustered file system, without most of the NFS shortcomings.
Ceph have very nice features like online upgrades, online maintenance, constant deep scrubbing of replicated data, cache tiering (HD -> SSD -> etc), erasure coding (clustered RAID6 for really old email - I'm not using), etc. Ceph is very complex to operate but very flexible and robust.
This year we moved our Dovecot servers to a Ceph based system, found one bug there (http://tracker.ceph.com/issues/15920) when Dovecot's LMTP was delivery an email, and the fix is about to be released on version 10.2.3. I have been using a fix-build here for a couple of months without issue. So, now I'm glad to share with you guys that it works really well!
My setup involves two clusters, each with about 30k-40k users. Each cluster will have two HD storages (with 6TB HDs), two SSD storages (with Intel 480GB SSDs) and two frontends. In a few months will add a third server of each type. Clusters work better with 3's.
Here we used mdbox as it is better performant on Ceph for maintenance tasks, since each file is an object on Ceph and maintenance costs increase with the number of objects. We created two base directories:
- /srv/dovecot/mail/%d/%n - stored on HDs with the most recent files cached on SSDs, thanks to Ceph Cache Tiering. Also, the directory structure itself is stored on SSDs, so dir listings are very fast (Ceph FS Metadata).
- /srv/dovecot/index/%d/%n - stored only on SSDs, thanks for Ceph FS file layout.
On our setup about 17% of the IOPs are going to HDs, the rest will go to SSDs, even though SSDs are less than 5% of the space. This is a matter of tuning the cache tiering parameters, but we didn't look at that yet.
That setup is working like a charm, performance is about 53% better than when we were using NFS on the same hardware. Our previous DRBD+Heartbeat setup didn't allow for online maintenance and had a few problems. Now we can do 100% online maintenance on storage without users noticing, and on frontends with just a reconnect but without any downtime.
Ceph is hard to learn at first but those with bigger setups and stronger SLAs will want to take a look at that. I really recommend that the Dovecot community take at look at that setup.
Good luck!
Best, Daniel Colchete
[1] http://docs.ceph.com/docs/hammer/dev/differences-from-posix/
Hello Daniel,
Firstly, I've been using dovecot since the very early days and Ceph for nearly 3 years and am quite happy and quite familiar with both. However I currently have no plans to change from a DRBD cluster pair setup for mailbox servers to anything based on Ceph, mostly for performance and cost reasons.
I'm definitely not trying to rain on your parade, but I do have few points and issues, see in-line below.
On Wed, 17 Aug 2016 17:29:29 -0300 Daniel van Ham Colchete wrote:
I would like to report that from version 10.2.3 on (next release), Ceph FS is working really well with Dovecot systems.
For those that don't know, Ceph is a "distributed object store and file system designed to provide excellent performance, reliability and scalability.". We have used it here since 2013 very successfully, but never with our Dovecot setup. For more information go on http://ceph.com/.
Since Ceph Jewel (the current version), Ceph FS is considered production ready by their team. With Ceph FS you have a cache-coherent POSIX-compliant [1] clustered file system, without most of the NFS shortcomings.
Ceph have very nice features like online upgrades, online maintenance, constant deep scrubbing of replicated data, cache tiering (HD -> SSD -> etc), erasure coding (clustered RAID6 for really old email - I'm not using), etc. Ceph is very complex to operate but very flexible and robust.
For the record, the "deep scrubbing" is neither constant (by default once a week) and with the current "filestore" storage backend finding out which is the good replica in case of a scrub error is left as an exercise for the operator. Something that is going to be addressed by "Bluestore", which is going to be stable in about 2 releases from now.
Also (AFAIK) cache tiering (which is quite nice) doesn't offer more than 2 layers (so SSD pool in front of HDD pool) at this point in time.
This year we moved our Dovecot servers to a Ceph based system, found one bug there (http://tracker.ceph.com/issues/15920) when Dovecot's LMTP was delivery an email, and the fix is about to be released on version 10.2.3. I have been using a fix-build here for a couple of months without issue. So, now I'm glad to share with you guys that it works really well!
My setup involves two clusters, each with about 30k-40k users. Two separate Ceph clusters? If so, why? A larger, shared Ceph cluster would give you more peak performance and more flexibility with your frontends.
Each cluster will have two HD storages (with 6TB HDs), two SSD storages (with Intel 480GB SSDs) and two frontends. In a few months will add a third server of each type. Clusters work better with 3's.
Ceph tends to perform better with smaller and more storage devices, but of course that conflicts with keeping things dense and cost down. There are 2 things in that paragraph which set of alarms here:
480GB Intel SSDs sound like DC S3510, which have an endurance of 0.3 DWPD (over 5 years), 150GB per day. Given that Ceph needs a journal, that's 75GB/day. Now this might be fine if you have many of them and/or not much write activity. But I'd religiously monitor the wearout levels of these SSDs. On a mailbox cluster with similar user numbers to yours I see about 80GB/day write activity.
Since you're using 2x replication, with a dual node cluster and plain HDDs, you're running the equivalent to a RAID5 when it comes to reliability and fault tolerance. Danger, Will Robinson.
Here we used mdbox as it is better performant on Ceph for maintenance tasks, since each file is an object on Ceph and maintenance costs increase with the number of objects. Yes, that's one of the reasons I haven't considered Ceph, I do like and prefer the transparency of maildir.
We created two base directories:
- /srv/dovecot/mail/%d/%n - stored on HDs with the most recent files cached on SSDs, thanks to Ceph Cache Tiering. Also, the directory structure itself is stored on SSDs, so dir listings are very fast (Ceph FS Metadata).
- /srv/dovecot/index/%d/%n - stored only on SSDs, thanks for Ceph FS file layout.
Yup, that's a nice feature of CephFS.
On our setup about 17% of the IOPs are going to HDs, the rest will go to SSDs, even though SSDs are less than 5% of the space. This is a matter of tuning the cache tiering parameters, but we didn't look at that yet.
See above about SSD endurance issues, for cache-tiering tips, pipe up on the CephML.
That setup is working like a charm, performance is about 53% better than when we were using NFS on the same hardware. You used NFS on top of RBD if I read the tracker correctly, right?
Any reason for not doing something similar to the DRBD setup you were familiar with, that is Pacemaker and mounting RBD (and FS) from it? That should have been significantly more performant.
Our previous DRBD+Heartbeat setup didn't allow for online maintenance and had a few problems. Now we can do 100% online maintenance on storage without users noticing, and on frontends with just a reconnect but without any downtime.
DRBD and Pacemaker can have issues, especially with some buggy resource agents around. Failing over a node in a controlled fashion takes a few seconds at most here, also in the "not noticeable" ballpark.
Given that: a) with DRBD reads are local b) considering a) Ceph will always have the disadvantage of having to go via the net for everything and the resulting latency issues. c) to get roughly the same level of performance and reliability, one needs at least 33% more HW (storage) with Ceph and that's not including the additional frontends.
So again, for the time being I'm happier to stay with DRBD pairs. Especially since we have a custom, in-house made migration system in place that will move dead-ish/large/low-usage mailboxes to slower clusters and smallish/high-usage mailboxes to faster ones.
Ceph is hard to learn at first but those with bigger setups and stronger SLAs will want to take a look at that. I really recommend that the Dovecot community take at look at that setup.
I agree with all parts of this, particular if you're not trying to squeeze the last ounce of speed from the least amount of rack space.
There's another aspect of Ceph that may be of interest with Dovecot, using the object storage interface. However that's not supporting native Ceph interfaces and by its very nature also is slowish, but has nice scalability.
Regards,
Christian
Good luck!
Best, Daniel Colchete
[1] http://docs.ceph.com/docs/hammer/dev/differences-from-posix/
--
Christian Balzer Network/Systems Engineer
chibi@gol.com Global OnLine Japan/Rakuten Communications
http://www.gol.com/
Hi,
On 17 Aug 2016, at 23:29, Daniel van Ham Colchete <daniel.colchete@gmail.com> wrote: My setup involves two clusters, each with about 30k-40k users. Each cluster will have two HD storages (with 6TB HDs), two SSD storages (with Intel 480GB SSDs) and two frontends. In a few months will add a third server of each type. Clusters work better with 3’s.
Just a question. Looking at your usercount, this setup of yours still includes just one dovecot backend system and you are not actually running clustered dovecot setup with multiple instances accessing the users mailboxes?
Sami
participants (3)
-
Christian Balzer
-
Daniel van Ham Colchete
-
Sami Ketola