It happened now twice that replication created folders and mails in the wrong mailbox :(
Here's the architecture we use:
- 2 Dovecot (2.2.32) backends in two different datacenters replicating via a VPN connection
- Dovecot directors in both datacenters talks to both backends with vhost_count of 100 vs 1 for local vs remote backend
- backends use proxy dict via a unix domain socket and socat to talk via tcp to a dict on a different server (kubernetes cluster)
- backends have a local sqlite userdb for iteration (also containing home directories, as just iteration is not possible)
- serving around 7000 mailboxes in a roughly 200 different domains
Everything works as expected, until dict is not reachable eg. due to a server failure or a planed reboot of a node of the kubernetes cluster. In that situation it can happen that some requests are not answered, even with Kubernetes running multiple instances of the dict. I can only speculate what happens then: it seems the connection failure to the remote dict is not correctly handled and leads to situation in which last mailbox/home directory is used for the replication :(
When it happened the first time we attributed it to the fact that the Sqlite database at that time contained no home directory information, which we fixed after. This first time (server failure) took a couple of minutes and lead to many mailboxes containing mostly folders but also some new arrived mails belonging to other mailboxes/users. We could only resolve that situation by rolling back to a zfs snapshot before the downtime.
The second time was last Friday night during a (much shorter) reboot of a Kubernetes node and lead only to a single mailbox containing folders and mails of other mailboxes. That was verified by looking at timestamps of directories below $home/mdbox/mailboxes and files in $home/mdbox/storage. I can not tell if adding the home directory to the Sqlite database or the shorter time of the failure limited the wrong replication to a single mailbox.
Can someone with more knowledge of the Dovecot code please check/verify how replication deals with failures in proxy dict. I'm of cause happy to provide more information of our configuration if needed.
Here is an exert of our configuration (full doveconf -n is attached):
passdb { args = /etc/dovecot/dovecot-dict-master-auth.conf driver = dict master = yes } passdb { args = /etc/dovecot/dovecot-dict-auth.conf driver = dict } userdb { driver = prefetch } userdb { args = /etc/dovecot/dovecot-dict-auth.conf driver = dict } userdb { args = /etc/dovecot/dovecot-sql.conf driver = sql }
dovecot-dict-auth.conf: uri = proxy:/var/run/dovecot_auth_proxy/socket:backend password_key = passdb/%u/%w user_key = userdb/%u iterate_disable = yes
dovecot-dict-master-auth.conf: uri = proxy:/var/run/dovecot_auth_proxy/socket:backend password_key = master/%{login_user}/%u/%w iterate_disable = yes
dovecot-sql.conf: driver = sqlite connect = /etc/dovecot/users.sqlite user_query = SELECT home,NULL AS uid,NULL AS gid FROM users WHERE userid = '%n' AND domain = '%d' iterate_query = SELECT userid AS username, domain FROM users
-- Ralf Becker EGroupware GmbH [www.egroupware.org] Handelsregister HRB Kaiserslautern 3587 Geschäftsführer Birgit und Ralf Becker Leibnizstr. 17, 67663 Kaiserslautern, Germany Telefon +49 631 31657-0