Dovecot Replication Errors (only) when using tcps: as the mail_replica Protocol
Hi,
I've been seeing errors logged for some time with replication processes, whereby replication sessions seem to be timing out periodically.
This is with dovecot version 2.3.10.1 (a3d0e1171) and both are Gentoo x86_64.
After some investigation I've determined that these timeouts are only ever occurring with tcps as the replication connection type. These errors never occur if non-encrypted tcp is configured. I've been able to validate this by changing only the replica_type on both ends of the replication configuration to tcp, and with no other changes and after a few days of operation there is not a single error logged.
mail_replica = tcps:lightning.reub.net:4813 <<< periodic timeouts mail_replica = tcp:lightning.reub.net:4814 <<< works perfectly
Example of the error:
Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]: dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error: dsync(lightning.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mailbox_delete, last recv=handshake) Jun 12 15:45:44 thunderstorm.reub.net dovecot[21149]: dsync-local(kaylene)<zx+WKTAU416UMwAAzkCIew>: Error: Timeout during state=recv_mailbox_tree
doveadm: Error: Timeout during state=slave_recv_mailbox: 6 Time(s) doveadm: Error: Timeout during state=sync_mails (send=mail_requests recv=attributes): 31 Time(s) doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox): 31 Time(s) doveadm: Error: dsync(thunderstorm.reub.net): I/O has stalled, no activity for 600 seconds (last sent=mailbox_delete, last recv=mailbox_delete): 6 Time(s)
It is seen on both sides of the replication setup. The replica is offsite but only a few ms of latency away and there is no packet loss. The replication is happening over IPv6, and the local firewall is logging that sessions are always permitted, and only ever finishing due to tcp-fin or tcp-rst-from-client .
SSL appears to be correctly configured, and it seems that the replication itself is for the most part working. Clients are able to use imaps just fine so I don't think there's anything much wrong from an SSL perspective else I'd be seeing complete replication failure and/or client devices unable to connect.
Can anyone suggest how we can further debug this problem?
Thanks, Reuben
Hello,
I have two mail servers and am also experiencing sporadic replication errors over tcps, similar to Reuben. Each server is running Dovecot 2.3.11.3 (502c39af9) on Debian 10.6.
*Log entries from MX1* Nov 18 00:39:26 mx1 dovecot: dsync-local(user@example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 00:39:26 mx1 dovecot: dsync-local(user@example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) Nov 18 06:39:32 mx1 dovecot: dsync-local(user@example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 06:39:32 mx1 dovecot: dsync-local(user@example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) *End*
*Log entries from MX2* Nov 18 00:29:55 mx2 dovecot: dsync-local(user@example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't lock /var/vmail/user@example.com/.dovecot-sync.lock: fcntl(/var/vmail/user@example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:34:56 mx2 dovecot: dsync-local(user@example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't lock /var/vmail/user@example.com/.dovecot-sync.lock: fcntl(/var/vmail/user@example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) *End*
I have configured "replication_full_sync_interval = 1 hours", which explains why some of the sync errors occur at the same increment on the hour (if the error does occur).
I've tested replication over tcps using either IPv6 or IPv4 -- this did not appear to make a difference.
Changing replication to occur over tcp solves the issue (with "ssl = yes" commented out, as well).
IMAP clients are primarily connecting to MX1 using SSL, which works well (SSL connections to MX2 also work). These are very low traffic machines at the moment (just 1 user as I continue testing).
I've attached the output of "dovecot -n" from each server.
Are there known bugs with replication using SSL? I'd appreciate any guidance.
Thank you, AP
On 18/11/2020 19:37, Aakash Patel wrote:
Hello,
I have two mail servers and am also experiencing sporadic replication errors over tcps, similar to Reuben. Each server is running Dovecot 2.3.11.3 (502c39af9) on Debian 10.6.
*Log entries from MX1* Nov 18 00:39:26 mx1 dovecot: dsync-local(user@example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 00:39:26 mx1 dovecot: dsync-local(user@example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) Nov 18 06:39:32 mx1 dovecot: dsync-local(user@example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 06:39:32 mx1 dovecot: dsync-local(user@example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) *End*
*Log entries from MX2* Nov 18 00:29:55 mx2 dovecot: dsync-local(user@example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't lock /var/vmail/user@example.com/.dovecot-sync.lock: fcntl(/var/vmail/user@example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:34:56 mx2 dovecot: dsync-local(user@example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't lock /var/vmail/user@example.com/.dovecot-sync.lock: fcntl(/var/vmail/user@example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) *End*
I have configured "replication_full_sync_interval = 1 hours", which explains why some of the sync errors occur at the same increment on the hour (if the error does occur).
I've tested replication over tcps using either IPv6 or IPv4 -- this did not appear to make a difference.
Changing replication to occur over tcp solves the issue (with "ssl = yes" commented out, as well).
IMAP clients are primarily connecting to MX1 using SSL, which works well (SSL connections to MX2 also work). These are very low traffic machines at the moment (just 1 user as I continue testing).
I've attached the output of "dovecot -n" from each server.
Are there known bugs with replication using SSL? I'd appreciate any guidance.
Thank you, AP
For what it's worth, I had the same issue when setting this up a few weeks ago. I switched to using SSH based transport and it's been great ever since. Is that an option for you?
dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u mail_replica = remote:root@xx.xx.xx.xx
Cheers James
Am 19.11.20 um 09:30 schrieb James Pattinson:
On 18/11/2020 19:37, Aakash Patel wrote:
Hello,
I have two mail servers and am also experiencing sporadic replication errors over tcps, similar to Reuben. Each server is running Dovecot 2.3.11.3 (502c39af9) on Debian 10.6.
*Log entries from MX1* Nov 18 00:39:26 mx1 dovecot: dsync-local(user@example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 00:39:26 mx1 dovecot: dsync-local(user@example.com)<Ow3zAjWxtF+TDgAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) Nov 18 06:39:32 mx1 dovecot: dsync-local(user@example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: dsync(mx2.example.com): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state) Nov 18 06:39:32 mx1 dovecot: dsync-local(user@example.com)<6bScGpwFtV+vEQAAPHKnuQ>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox) *End*
*Log entries from MX2* Nov 18 00:29:55 mx2 dovecot: dsync-local(user@example.com)<fKK3JzWxtF9zAgAA5XpYKg>: Error: Couldn't lock /var/vmail/user@example.com/.dovecot-sync.lock: fcntl(/var/vmail/user@example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:34:56 mx2 dovecot: dsync-local(user@example.com)<9IKaB2KytF92AgAA5XpYKg>: Error: Couldn't lock /var/vmail/user@example.com/.dovecot-sync.lock: fcntl(/var/vmail/user@example.com/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 628) Nov 18 00:39:26 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) Nov 18 06:39:32 mx2 dovecot: doveadm: Error: dsync(mx1.example.com): I/O has stalled, no activity for 600 seconds (last sent=mail_change (EOL), last recv=mailbox) *End*
I have configured "replication_full_sync_interval = 1 hours", which explains why some of the sync errors occur at the same increment on the hour (if the error does occur).
I've tested replication over tcps using either IPv6 or IPv4 -- this did not appear to make a difference.
Changing replication to occur over tcp solves the issue (with "ssl = yes" commented out, as well).
IMAP clients are primarily connecting to MX1 using SSL, which works well (SSL connections to MX2 also work). These are very low traffic machines at the moment (just 1 user as I continue testing).
I've attached the output of "dovecot -n" from each server.
Are there known bugs with replication using SSL? I'd appreciate any guidance.
Thank you, AP
For what it's worth, I had the same issue when setting this up a few weeks ago. I switched to using SSH based transport and it's been great ever since. Is that an option for you?
dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u mail_replica = remote:root@xx.xx.xx.xx
Cheers James
I am seeing the same errors with tcps. For me with version 2.3.11.3 under CentOS. An attempt without SSL looks better, I will probably also try it via ssh.
Best regards Philipp
participants (4)
-
Aakash Patel
-
James Pattinson
-
Philipp Faeustlin
-
Reuben Farrelly