Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol

James Pattinson james at pattinson.org
Thu Nov 19 10:30:21 EET 2020


On 18/11/2020 19:37, Aakash Patel wrote:
> Hello,
>
> I have two mail servers and am also experiencing sporadic replication 
> errors over tcps, similar to Reuben. Each server is running Dovecot 
> 2.3.11.3 (502c39af9) on Debian 10.6.
>
> *Log entries from MX1*
> Nov 18 00:39:26 mx1 dovecot: 
> dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: 
> dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds 
> (last sent=mailbox, last recv=mailbox_state)
> Nov 18 00:39:26 mx1 dovecot: 
> dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout 
> during state=sync_mails (send=mailbox recv=mailbox)
> Nov 18 06:39:32 mx1 dovecot: 
> dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: 
> dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds 
> (last sent=mailbox, last recv=mailbox_state)
> Nov 18 06:39:32 mx1 dovecot: 
> dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout 
> during state=sync_mails (send=mailbox recv=mailbox)
> *End*
>
> *Log entries from MX2*
> Nov 18 00:29:55 mx2 dovecot: 
> dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't 
> lock /var/vmail/user at example.com/.dovecot-sync.lock: 
> fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, 
> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held 
> by pid 628)
> Nov 18 00:34:56 mx2 dovecot: 
> dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't 
> lock /var/vmail/user at example.com/.dovecot-sync.lock: 
> fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, 
> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held 
> by pid 628)
> Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): 
> I/O has stalled, no activity for 600 seconds (last sent=mail_change 
> (EOL), last recv=mailbox)
> Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): 
> I/O has stalled, no activity for 600 seconds (last sent=mail_change 
> (EOL), last recv=mailbox)
> *End*
>
> I have configured "replication_full_sync_interval = 1 hours", which 
> explains why some of the sync errors occur at the same increment on 
> the hour (if the error does occur).
>
> I've tested replication over tcps using either IPv6 or IPv4 -- this 
> did not appear to make a difference.
>
> Changing replication to occur over tcp solves the issue (with "ssl = 
> yes" commented out, as well).
>
> IMAP clients are primarily connecting to MX1 using SSL, which works 
> well (SSL connections to MX2 also work). These are very low traffic 
> machines at the moment (just 1 user as I continue testing).
>
> I've attached the output of "dovecot -n" from each server.
>
> Are there known bugs with replication using SSL? I'd appreciate any 
> guidance.
>
> Thank you,
> AP
>
For what it's worth, I had the same issue when setting this up a few 
weeks ago. I switched to using SSH based transport and it's been great 
ever since. Is that an option for you?

dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u
mail_replica = remote:root at xx.xx.xx.xx

Cheers
James




More information about the dovecot mailing list