Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
James Pattinson
james at pattinson.org
Thu Nov 19 10:30:21 EET 2020
On 18/11/2020 19:37, Aakash Patel wrote:
> Hello,
>
> I have two mail servers and am also experiencing sporadic replication
> errors over tcps, similar to Reuben. Each server is running Dovecot
> 2.3.11.3 (502c39af9) on Debian 10.6.
>
> *Log entries from MX1*
> Nov 18 00:39:26 mx1 dovecot:
> dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error:
> dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds
> (last sent=mailbox, last recv=mailbox_state)
> Nov 18 00:39:26 mx1 dovecot:
> dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout
> during state=sync_mails (send=mailbox recv=mailbox)
> Nov 18 06:39:32 mx1 dovecot:
> dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error:
> dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds
> (last sent=mailbox, last recv=mailbox_state)
> Nov 18 06:39:32 mx1 dovecot:
> dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout
> during state=sync_mails (send=mailbox recv=mailbox)
> *End*
>
> *Log entries from MX2*
> Nov 18 00:29:55 mx2 dovecot:
> dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't
> lock /var/vmail/user at example.com/.dovecot-sync.lock:
> fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock,
> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held
> by pid 628)
> Nov 18 00:34:56 mx2 dovecot:
> dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't
> lock /var/vmail/user at example.com/.dovecot-sync.lock:
> fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock,
> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held
> by pid 628)
> Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com):
> I/O has stalled, no activity for 600 seconds (last sent=mail_change
> (EOL), last recv=mailbox)
> Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com):
> I/O has stalled, no activity for 600 seconds (last sent=mail_change
> (EOL), last recv=mailbox)
> *End*
>
> I have configured "replication_full_sync_interval = 1 hours", which
> explains why some of the sync errors occur at the same increment on
> the hour (if the error does occur).
>
> I've tested replication over tcps using either IPv6 or IPv4 -- this
> did not appear to make a difference.
>
> Changing replication to occur over tcp solves the issue (with "ssl =
> yes" commented out, as well).
>
> IMAP clients are primarily connecting to MX1 using SSL, which works
> well (SSL connections to MX2 also work). These are very low traffic
> machines at the moment (just 1 user as I continue testing).
>
> I've attached the output of "dovecot -n" from each server.
>
> Are there known bugs with replication using SSL? I'd appreciate any
> guidance.
>
> Thank you,
> AP
>
For what it's worth, I had the same issue when setting this up a few
weeks ago. I switched to using SSH based transport and it's been great
ever since. Is that an option for you?
dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u
mail_replica = remote:root at xx.xx.xx.xx
Cheers
James
More information about the dovecot
mailing list