Reporting on CephFS being ready to use with Dovecot

Thu Aug 18 03:12:15 UTC 2016

Hello Daniel,

Firstly, I've been using dovecot since the very early days and Ceph for
nearly 3 years and am quite happy and quite familiar with both.
However I currently have no plans to change from a DRBD cluster pair setup
for mailbox servers to anything based on Ceph, mostly for performance and
cost reasons.

I'm definitely not trying to rain on your parade, but I do have few points
and issues, see in-line below.

On Wed, 17 Aug 2016 17:29:29 -0300 Daniel van Ham Colchete wrote:

> I would like to report that from version 10.2.3 on (next release), Ceph FS
> is working really well with Dovecot systems.
> 
> For those that don't know, Ceph is a "distributed object store and file
> system designed to provide excellent performance, reliability and
> scalability.". We have used it here since 2013 very successfully, but never
> with our Dovecot setup. For more information go on http://ceph.com/.
> 
> Since Ceph Jewel (the current version), Ceph FS is considered production
> ready by their team. With Ceph FS you have a cache-coherent POSIX-compliant
> [1] clustered file system, without most of the NFS shortcomings.
> 
> Ceph have very nice features like online upgrades, online maintenance,
> constant deep scrubbing of replicated data, cache tiering (HD -> SSD ->
> etc), erasure coding (clustered RAID6 for really old email - I'm not
> using), etc. Ceph is very complex to operate but very flexible and robust.
> 
For the record, the "deep scrubbing" is neither constant (by default once
a week) and with the current "filestore" storage backend finding out which
is the good replica in case of a scrub error is left as an exercise for the
operator. 
Something that is going to be addressed by "Bluestore", which is going to
be stable in about 2 releases from now.

Also (AFAIK) cache tiering (which is quite nice) doesn't offer more than 2
layers (so SSD pool in front of HDD pool) at this point in time.

> This year we moved our Dovecot servers to a Ceph based system, found one
> bug there (http://tracker.ceph.com/issues/15920) when Dovecot's LMTP was
> delivery an email, and the fix is about to be released on version 10.2.3. I
> have been using a fix-build here for a couple of months without issue. So,
> now I'm glad to share with you guys that it works really well!
> 
> My setup involves two clusters, each with about 30k-40k users. 
Two separate Ceph clusters?
If so, why? A larger, shared Ceph cluster would give you more peak
performance and more flexibility with your frontends.

>Each cluster
> will have two HD storages (with 6TB HDs), two SSD storages (with Intel
> 480GB SSDs) and two frontends. In a few months will add a third server of
> each type. Clusters work better with 3's.
> 
Ceph tends to perform better with smaller and more storage devices, but of
course that conflicts with keeping things dense and cost down.
There are 2 things in that paragraph which set of alarms here:

1. 480GB Intel SSDs sound like DC S3510, which have an endurance of 0.3
DWPD (over 5 years), 150GB per day. Given that Ceph needs a journal,
that's 75GB/day. 
Now this might be fine if you have many of them and/or not much write
activity. But I'd religiously monitor the wearout levels of these SSDs.
On a mailbox cluster with similar user numbers to yours I see about
80GB/day write activity.

2. Since you're using 2x replication, with a dual node cluster and plain
HDDs, you're running the equivalent to a RAID5 when it comes to
reliability and fault tolerance.  Danger, Will Robinson. 

> Here we used mdbox as it is better performant on Ceph for maintenance
> tasks, since each file is an object on Ceph and maintenance costs increase
> with the number of objects. 
Yes, that's one of the reasons I haven't considered Ceph, I do like and
prefer the transparency of maildir.

>We created two base directories:
> 
>  - /srv/dovecot/mail/%d/%n - stored on HDs with the most recent files
> cached on SSDs, thanks to Ceph Cache Tiering. Also, the directory structure
> itself is stored on SSDs, so dir listings are very fast (Ceph FS Metadata).
>  - /srv/dovecot/index/%d/%n - stored only on SSDs, thanks for Ceph FS file
> layout.
>
Yup, that's a nice feature of CephFS.

> On our setup about 17% of the IOPs are going to HDs, the rest will go to
> SSDs, even though SSDs are less than 5% of the space. This is a matter of
> tuning the cache tiering parameters, but we didn't look at that yet.
>
See above about SSD endurance issues, for cache-tiering tips, pipe up on
the CephML.

> That setup is working like a charm, performance is about 53% better than
> when we were using NFS on the same hardware. 
You used NFS on top of RBD if I read the tracker correctly, right?

Any reason for not doing something similar to the DRBD setup you were
familiar with, that is Pacemaker and mounting RBD (and FS) from it?
That should have been significantly more performant. 

> Our previous DRBD+Heartbeat
> setup didn't allow for online maintenance and had a few problems. Now we
> can do 100% online maintenance on storage without users noticing, and on
> frontends with just a reconnect but without any downtime.
> 
DRBD and Pacemaker can have issues, especially with some buggy resource
agents around.
Failing over a node in a controlled fashion takes a few seconds at most
here, also in the "not noticeable" ballpark.

Given that:
a) with DRBD reads are local
b) considering a) Ceph will always have the disadvantage of having to go
via the net for everything and the resulting latency issues.
c) to get roughly the same level of performance and reliability, one needs
at least 33% more HW (storage) with Ceph and that's not including the
additional frontends.

So again, for the time being I'm happier to stay with DRBD pairs.
Especially since we have a custom, in-house made migration system in place
that will move dead-ish/large/low-usage mailboxes to slower clusters and
smallish/high-usage mailboxes to faster ones.

> Ceph is hard to learn at first but those with bigger setups and stronger
> SLAs will want to take a look at that. I really recommend that the Dovecot
> community take at look at that setup.
> 
I agree with all parts of this, particular if you're not trying to squeeze
the last ounce of speed from the least amount of rack space.

There's another aspect of Ceph that may be of interest with Dovecot, using
the object storage interface.
However that's not supporting native Ceph interfaces and by its very
nature also is slowish, but has nice scalability.

Regards,

Christian
> Good luck!
> 
> Best,
> Daniel Colchete
> 
> [1] http://docs.ceph.com/docs/hammer/dev/differences-from-posix/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/