Replication question

12 Jan 2020

      Hi
I'm currently debugging some replication issues between two dovecot
2.3.9.2 servers, where one is live and the other is just a copy used for
backup with no imap user access. After initial alignment (with various
error messages such as the stalled io messages a fnctl lock messages) I
am seeing replication miss messages or stop altogether on mailboxes,
even with no further error messages.
doveadm: Error: dsync(REMOTE_HOSTNAME): I/O has stalled, no activity for
600 seconds (last sent=mail_change (EOL), last recv=mailbox)
doveadm: Error: Couldn't lock
/var/vmail/DOMAIN/USER//.dovecot-sync.lock:
fcntl(/var/vmail/DOMAIN/USER//.dovecot-sync.lock, write-lock, F_SETLKW)
locking failed: Timed out after 30 seconds (WRITE lock held by pid 30307)
I was surprised by this because although I know there were replication
issues in 2.3.8 I understood these were resolved in 2.3.9 when both
servers had 2.3.9.
I am still investigating and will post further if I get any useful insights.
However, I have a question, which despite using dovecot for many years
in this configuration has never occurred to me before. I configured
dovecot using the wiki https://wiki.dovecot.org/Replication using tcp
and ssl. Both servers have an identical dovecot configuration except for:

different hostnames

on the backup server I have removed expire and quota plugins in the
global mail_plugins

in the configuration of mail_replica tcps://hostname:port each server
points to the other server's hostname

What I just realized is that nowhere in the wiki does it state that both
servers should be set up for replication. I had always assumed that was
the logical thing to do. So the question is, for successful replication
is it sufficient to setup one master configuration and just have a
replication process listening on the other master, or should both
servers be set up for replication in an almost identical way (with the 3
exceptions above)?
thanks for any insights.
John