<div dir="ltr">Hello Danny,<div><br></div><div>I actually saw that thread and I was very excited about it. I thank you all for that idea and all the effort being put in it.</div><div>I haven't yet tried to play around with your plugin but I intend to, and to contribute back. I think when it's ready for production it will be unbeatable.</div><div><br></div><div>I have watched your talk at Cephalocon (on YouTube). I'll see your slides, maybe they'll give me more insights on our infrastructure architecture.<br clear="all"><div><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><br></div><div>As you can see our business is still taking baby steps compared to Deutsche Telekom's but we face infrastructure challenges everyday since ever.</div><div>As for now, I think we could still fit with cephfs, but we definitely need some improvement.</div><div><br></div>Regards,<div><br></div><div>Webert Lima</div><div>DevOps Engineer at MAV Tecnologia</div><div><b>Belo Horizonte - Brasil</b></div><div><b>IRC NICK - WebertRLZ</b></div></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, May 16, 2018 at 4:42 PM Danny Al-Gaaf <<a href="mailto:danny.al-gaaf@bisect.de">danny.al-gaaf@bisect.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

some time back we had similar discussions when we, as an email provider,<br>

discussed to move away from traditional NAS/NFS storage to Ceph.<br>

<br>

The problem with POSIX file systems and dovecot is that e.g. with mdbox<br>

only around ~20% of the IO operations are READ/WRITE, the rest are<br>

metadata IOs. You will not change this with using CephFS since it will<br>

basically behave the same way as e.g. NFS.<br>

<br>

We decided to develop librmb to store emails as objects directly in<br>

RADOS instead of CephFS. The project is still under development, so you<br>

should not use it in production, but you can try it to run a POC.<br>

<br>

For more information check out my slides from Ceph Day London 2018:<br>

<a href="https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page" rel="noreferrer" target="_blank">https://dalgaaf.github.io/cephday-london2018-emailstorage/#/cover-page</a><br>

<br>

The project can be found on github:<br>

<a href="https://github.com/ceph-dovecot/" rel="noreferrer" target="_blank">https://github.com/ceph-dovecot/</a><br>

<br>

-Danny<br>

<br>

Am 16.05.2018 um 20:37 schrieb Webert de Souza Lima:<br>

> I'm sending this message to both dovecot and ceph-users ML so please don't<br>

> mind if something seems too obvious for you.<br>

> <br>

> Hi,<br>

> <br>

> I have a question for both dovecot and ceph lists and below I'll explain<br>

> what's going on.<br>

> <br>

> Regarding dbox format (<a href="https://wiki2.dovecot.org/MailboxFormat/dbox" rel="noreferrer" target="_blank">https://wiki2.dovecot.org/MailboxFormat/dbox</a>), when<br>

> using sdbox, a new file is stored for each email message.<br>

> When using mdbox, multiple messages are appended to a single file until it<br>

> reaches/passes the rotate limit.<br>

> <br>

> I would like to understand better how the mdbox format impacts on IO<br>

> performance.<br>

> I think it's generally expected that fewer larger file translate to less IO<br>

> and more troughput when compared to more small files, but how does dovecot<br>

> handle that with mdbox?<br>

> If dovecot does flush data to storage upon each and every new email is<br>

> arrived and appended to the corresponding file, would that mean that it<br>

> generate the same ammount of IO as it would do with one file per message?<br>

> Also, if using mdbox many messages will be appended to a said file before a<br>

> new file is created. That should mean that a file descriptor is kept open<br>

> for sometime by dovecot process.<br>

> Using cephfs as backend, how would this impact cluster performance<br>

> regarding MDS caps and inodes cached when files from thousands of users are<br>

> opened and appended all over?<br>

> <br>

> I would like to understand this better.<br>

> <br>

> Why?<br>

> We are a small Business Email Hosting provider with bare metal, self hosted<br>

> systems, using dovecot for servicing mailboxes and cephfs for email storage.<br>

> <br>

> We are currently working on dovecot and storage redesign to be in<br>

> production ASAP. The main objective is to serve more users with better<br>

> performance, high availability and scalability.<br>

> * high availability and load balancing is extremely important to us *<br>

> <br>

> On our current model, we're using mdbox format with dovecot, having<br>

> dovecot's INDEXes stored in a replicated pool of SSDs, and messages stored<br>

> in a replicated pool of HDDs (under a Cache Tier with a pool of SSDs).<br>

> All using cephfs / filestore backend.<br>

> <br>

> Currently there are 3 clusters running dovecot 2.2.34 and ceph Jewel<br>

> (10.2.9-4).<br>

>  - ~25K users from a few thousands of domains per cluster<br>

>  - ~25TB of email data per cluster<br>

>  - ~70GB of dovecot INDEX [meta]data per cluster<br>

>  - ~100MB of cephfs metadata per cluster<br>

> <br>

> Our goal is to build a single ceph cluster for storage that could expand in<br>

> capacity, be highly available and perform well enough. I know, that's what<br>

> everyone wants.<br>

> <br>

> Cephfs is an important choise because:<br>

>  - there can be multiple mountpoints, thus multiple dovecot instances on<br>

> different hosts<br>

>  - the same storage backend is used for all dovecot instances<br>

>  - no need of sharding domains<br>

>  - dovecot is easily load balanced (with director sticking users to the<br>

> same dovecot backend)<br>

> <br>

> On the upcoming upgrade we intent to:<br>

>  - upgrade ceph to 12.X (Luminous)<br>

>  - drop the SSD Cache Tier (because it's deprecated)<br>

>  - use bluestore engine<br>

> <br>

> I was said on freenode/#dovecot that there are many cases where SDBOX would<br>

> perform better with NFS sharing.<br>

> In case of cephfs, at first, I wouldn't think that would be true because<br>

> more files == more generated IO, but thinking about what I said in the<br>

> beginning regarding sdbox vs mdbox that could be wrong.<br>

> <br>

> Any thoughts will be highlt appreciated.<br>

> <br>

> Regards,<br>

> <br>

> Webert Lima<br>

> DevOps Engineer at MAV Tecnologia<br>

> *Belo Horizonte - Brasil*<br>

> *IRC NICK - WebertRLZ*<br>

> <br>

> <br>

> <br>

> _______________________________________________<br>

> ceph-users mailing list<br>

> <a href="mailto:ceph-users@lists.ceph.com" target="_blank">ceph-users@lists.ceph.com</a><br>

> <a href="http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com" rel="noreferrer" target="_blank">http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com</a><br>

> <br>

</blockquote></div>