Hi
I'm currently debugging some replication issues between two dovecot 2.3.9.2 servers, where one is live and the other is just a copy used for backup with no imap user access. After initial alignment (with various error messages such as the stalled io messages a fnctl lock messages) I am seeing replication miss messages or stop altogether on mailboxes, even with no further error messages.
doveadm: Error: dsync(REMOTE_HOSTNAME): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox)
doveadm: Error: Couldn't lock /var/vmail/DOMAIN/USER//.dovecot-sync.lock: fcntl(/var/vmail/DOMAIN/USER//.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 30307)
I was surprised by this because although I know there were replication issues in 2.3.8 I understood these were resolved in 2.3.9 when both servers had 2.3.9.
I am still investigating and will post further if I get any useful insights.
However, I have a question, which despite using dovecot for many years in this configuration has never occurred to me before. I configured dovecot using the wiki https://wiki.dovecot.org/Replication using tcp and ssl. Both servers have an identical dovecot configuration except for:
different hostnames
on the backup server I have removed expire and quota plugins in the global mail_plugins
in the configuration of mail_replica tcps://hostname:port each server points to the other server's hostname
What I just realized is that nowhere in the wiki does it state that both servers should be set up for replication. I had always assumed that was the logical thing to do. So the question is, for successful replication is it sufficient to setup one master configuration and just have a replication process listening on the other master, or should both servers be set up for replication in an almost identical way (with the 3 exceptions above)?
thanks for any insights.
John