Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol

Philipp Faeustlin philipp.faeustlin at uni-hohenheim.de
Fri Dec 18 14:58:10 EET 2020


Am 19.11.20 um 09:30 schrieb James Pattinson:
> 
> On 18/11/2020 19:37, Aakash Patel wrote:
>> Hello,
>>
>> I have two mail servers and am also experiencing sporadic replication 
>> errors over tcps, similar to Reuben. Each server is running Dovecot 
>> 2.3.11.3 (502c39af9) on Debian 10.6.
>>
>> *Log entries from MX1*
>> Nov 18 00:39:26 mx1 dovecot: 
>> dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: 
>> dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds 
>> (last sent=mailbox, last recv=mailbox_state)
>> Nov 18 00:39:26 mx1 dovecot: 
>> dsync-local(user at example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout 
>> during state=sync_mails (send=mailbox recv=mailbox)
>> Nov 18 06:39:32 mx1 dovecot: 
>> dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: 
>> dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds 
>> (last sent=mailbox, last recv=mailbox_state)
>> Nov 18 06:39:32 mx1 dovecot: 
>> dsync-local(user at example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout 
>> during state=sync_mails (send=mailbox recv=mailbox)
>> *End*
>>
>> *Log entries from MX2*
>> Nov 18 00:29:55 mx2 dovecot: 
>> dsync-local(user at example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't 
>> lock /var/vmail/user at example.com/.dovecot-sync.lock: 
>> fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, 
>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held 
>> by pid 628)
>> Nov 18 00:34:56 mx2 dovecot: 
>> dsync-local(user at example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't 
>> lock /var/vmail/user at example.com/.dovecot-sync.lock: 
>> fcntl(/var/vmail/user at example.com/.dovecot-sync.lock, write-lock, 
>> F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held 
>> by pid 628)
>> Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): 
>> I/O has stalled, no activity for 600 seconds (last sent=mail_change 
>> (EOL), last recv=mailbox)
>> Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): 
>> I/O has stalled, no activity for 600 seconds (last sent=mail_change 
>> (EOL), last recv=mailbox)
>> *End*
>>
>> I have configured "replication_full_sync_interval = 1 hours", which 
>> explains why some of the sync errors occur at the same increment on 
>> the hour (if the error does occur).
>>
>> I've tested replication over tcps using either IPv6 or IPv4 -- this 
>> did not appear to make a difference.
>>
>> Changing replication to occur over tcp solves the issue (with "ssl = 
>> yes" commented out, as well).
>>
>> IMAP clients are primarily connecting to MX1 using SSL, which works 
>> well (SSL connections to MX2 also work). These are very low traffic 
>> machines at the moment (just 1 user as I continue testing).
>>
>> I've attached the output of "dovecot -n" from each server.
>>
>> Are there known bugs with replication using SSL? I'd appreciate any 
>> guidance.
>>
>> Thank you,
>> AP
>>
> For what it's worth, I had the same issue when setting this up a few 
> weeks ago. I switched to using SSH based transport and it's been great 
> ever since. Is that an option for you?
> 
> dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u
> mail_replica = remote:root at xx.xx.xx.xx
> 
> Cheers
> James
> 
> 
I am seeing the same errors with tcps. For me with version 2.3.11.3 
under CentOS.
An attempt without SSL looks better, I will probably also try it via ssh.

Best regards
Philipp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5359 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://dovecot.org/pipermail/dovecot/attachments/20201218/4e65e61e/attachment.p7s>


More information about the dovecot mailing list