Definitive guide to running ObjectiveFS on top of a Clustered File System?
Hey there,
I am working on migrating a Dovecot system with + 50.000 users / 5TB on mail data to a new set-up that is better scalable. But I found it difficult to find a good solution. In this email I lay out what I learned, in the hope that others can profit. But also others can help me on this design before I actually start migrating.
*My aim is to get a clustered file system setup for Dovecot. But with the additional constraint of not having to pay on a per-mailbox basis to some vendor. *
Why a clustered file system? I believe this yields the best solution. Your machines are now stateless: consuming services. And scaling or recovery is now relatively easy.
Why the constraint of not paying per-mailbox? If you did pay a fee per mailbox; you probably would be better of just using a managed mail service altogether.
High-level
We are going to use ObjectiveFS as our clustered file system (not involved with them in any way). They use Fuse to make it behave like a POSIX filesystem. Due to the way they compact files that belong together it handles a large number of small files relatively fine. The great thing is: we can mount this filesystem on multiple machines. This piece of software is proprietary, but it's price is very reasonable, and it is a fixed monthly cost. We can use S3, Google Cloud Storage, or others as the actual storage system.
We store metadata about the mailboxes in a database that is available to all machines. We can use any managed database service for this. We explicitly set the proxy_host for each user in this database. Thus every user gets routed to the same machine. We make sure to distributed the users among the available machines.
All traffic is received by a TCP load balancer. This will forward the traffic to the underlying machines. Ideally this is a managed cloud service. I think there are two valid choices. But I prefer the first one:
- Use a TCP pass-through load balancer. This way the source IP is retained without requiring any change to the protocol (an example of this is Google Cloud external TCP/UDP Network Load Balancing https://cloud.google.com/load-balancing/docs/network).
- Use an TCP SSL terminating load balancer. And make it forward in HaProxy's PROXY format so we retain the source IP.
That traffic is now routed to N-machines. Each of those machines runs Dovecot. And Dovecot is configured with Dovecot Proxy (not to be confused with HaProxy's PROXY protocol). Such that Dovecot will proxy the connection to another machine if required (or continue on this machine if you are already on the correct proxy_host). I believe this can be done using proxy_maybe https://doc.dovecot.org/configuration_manual/authentication/proxies/. Depending on the choice of load balancer you may need to configure Dovecot to understand HaProxy's PROXY format.
We are going to use the maildir format. And store those files on the ObjectiveFS filesystem. The index to Dovecot is stored on a local SSD.
Choice motivations
ObjectiveFS:
- Can probably be another clustered file system as well.
- But ObjectiveFS works quite well because it has support for handling many small files. And can be configured with a local SSD disk.
- Using this software can get a clustered file system to run quite easy.
Regarding the mailbox format:
- dbox is said to be more performant. And having less files is an advantage in general. But one big disadvantage: because the index contains crucial information, you will need to place this index on the Clustered File System as well. Maildir does not have this problem, and that is why we prefer it.
- obox is only available as Dovecot Pro, and solves a lot of problems. But you really can only get Dovecot Pro if you pay a fee per mailbox. We don't want that!
Why don't use Dovecot Director as well?
- You can not run Dovecot Director on the same machine as your Dovecot Backend / proxy. Thus we would need to introduce 2 additional hosts to just accept and route the traffic. This would increase the complexity of our solution.
- But it could be build on top of this set-up fairly easy.
- I don't see too many advantages for Director however currently.
- One big advantage *could* be if it would automatically remove failed
nodes. However this only works if you use Dovemon
https://doc.dovecot.org/configuration_manual/dovemon/. And that is
packaged with OX/Dovecot and thus requires paying per user.
the proxy_host to another destination as well.
- So in case you want to put a node down you get a slightly easier API to do so. Namely *doveadm director add <backend server ip> 0*. However we can simply imitate this by updating our database and changing
Load balancer
- You could also set-up your MX records to directly point to your Dovecot nodes. This should work as well I guess.
- But I think the load balancer makes for better control, over changing and managing DNS.
Questions
In general I'd like to know whether this is a good idea!
But some other questions I have at this point that could make or break this setup:
- Can a Dovecot backend instance be configured to *also* run Dovecot proxy?
- Can a Dovecot Proxy node receive HaProxy PROXY traffic?
Details
Once I am sure I am on the right track I'd like to post the settings I use for the software as well. But let's first discuss the stuff above!
Best, Roel van Duijnhoven
Am 02.06.21 um 17:09 schrieb Roel van Duijnhoven:
I am working on migrating a Dovecot system with + 50.000 users / 5TB on mail data to a new set-up that is better scalable. But I found it difficult to find a good solution. In this email I lay out what I learned, in the hope that others can profit. But also others can help me on this design before I actually start migrating.
Hello Roel,
we're also planing a similar migration but choose a slightly other way:
- we split our users in a number of shards (10-20k per shard)
- lmtp and imap traffic hit a number of director servers (TCP loadbalancer or dns-rr)
- director determine the user's shard and direct the connection to the right shard backend
- a backend (pair of two machines) authenticate imap traffic and dsync to it's secondary
advantages:
- no cold standby machines
- scalable by adding pairs of backends
- scalable by adding new single directors
- every machine could be taken offline for maintenance
- every machine could fail without data lost or service degradation
- no proprietary software needed
The are our ideas ... Andreas
participants (2)
-
A. Schulze
-
Roel van Duijnhoven