Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol

13 Jun 2020


      Hi,
I've been seeing errors logged for some time with replication processes,
whereby replication sessions seem to be timing out periodically.
This is with dovecot version 2.3.10.1 (a3d0e1171) and both are Gentoo
x86_64.
After some investigation I've determined that these timeouts are only
ever occurring with tcps as the replication connection type.  These
errors never occur if non-encrypted tcp is configured.  I've been able
to validate this by changing only the replica_type on both ends of the
replication configuration to tcp, and with no other changes and after a
few days of operation there is not a single error logged.
mail_replica = tcps:lightning.reub.net:4813   <<< periodic timeouts
mail_replica = tcp:lightning.reub.net:4814   <<< works perfectly
Example of the error:
Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]:
dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error:
dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds
(last sent=mailbox_delete, last recv=handshake)
Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]:
dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error: Timeout during
state=recv_mailbox_tree
doveadm: Error: Timeout during state=slave_recv_mailbox: 6 Time(s)
doveadm: Error: Timeout during state=sync_mails (send=mail_requests
recv=attributes): 31 Time(s)
doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no
activity for 600 seconds (last sent=mail_change (EOL), last
recv=mailbox): 31 Time(s)
doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no
activity for 600 seconds (last sent=mailbox_delete, last
recv=mailbox_delete): 6 Time(s)
It is seen on both sides of the replication setup.  The replica is
offsite but only a few ms of latency away and there is no packet loss.
The replication is happening over IPv6, and the local firewall is
logging that sessions are always permitted, and only ever finishing due
to tcp-fin or tcp-rst-from-client .
SSL appears to be correctly configured, and it seems that the
replication itself is for the most part working.  Clients are able to
use imaps just fine so I don't think there's anything much wrong from an
SSL perspective else I'd be seeing complete replication failure and/or
client devices unable to connect.
Can anyone suggest how we can further debug this problem?
Thanks,
Reuben

Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol

Reuben Farrelly