Hi,
some of our customers have discovered a replication issue after upgraded from 2.3.7.2 to 2.3.8.
Running 2.3.8 several replication connections are hanging until defined timeout. So after some seconds there are $replication_max_conns hanging connections. Other replications are running fast and successful.
Also running a doveadm sync tcp:... is working fine for all users.
I can't see exactly, but I haven't seen mailboxes timeouting again and again. So I would assume it's not related to the mailbox.
From the logs:
server1: Oct 16 08:29:25 server1 dovecot[5715]: dsync-local(username1@domain.com)<FXnVDW22pl0tGAAA1cwDxA>: Error: dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version not received) Oct 16 08:29:25 server1 dovecot[5715]: dsync-local(username1@domain.com)<FXnVDW22pl0tGAAA1cwDxA>: Error: Timeout during state=master_recv_handshake
server2:
Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1) failed: EOF (last sent=handshake, last recv=handshake)
There aren't any additional logs regarding the replication.
I have tried increasing vsz_limit or reducing replication_max_conns. Nothing changed.
--
Both customers have 10k+ users. Currently I couldn't reproduce this on smaller test systems.
Both installation were downgraded to 2.3.7.2 to fix the issue for now
--
I've attached a tcpdump showing the client showing the client stops sending any data after the mailbox_guid table headers.
Any idea what could be wrong here or the debug this issue?
Thanks.
Carsten Rosenberg