I have the same Problem here.
All systems are running Debian 9 amd64.

My dovecot director servers are running 2.3.8, but the Mailbox Servers having sync / replication problems with 2.3.8. So i have downgraded the Mailbox Servers to 2.3.7 and now everything works fine again...

Am 18. Oktober 2019 13:52:37 MESZ schrieb Carsten Rosenberg via dovecot <dovecot@dovecot.org>:
Hi,

some of our customers have discovered a replication issue after
upgraded from 2.3.7.2 to 2.3.8.

Running 2.3.8 several replication connections are hanging until defined
timeout. So after some seconds there are $replication_max_conns hanging
connections.
Other replications are running fast and successful.

Also running a doveadm sync tcp:... is working fine for all users.

I can't see exactly, but I haven't seen mailboxes timeouting again and
again. So I would assume it's not related to the mailbox.

From the logs:

server1:
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local(username1@domain.com)<FXnVDW22pl0tGAAA1cwDxA>: Error:
dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
not received)
Oct 16 08:29:25 server1 dovecot[5715]:
dsync-local(username1@domain.com)<FXnVDW22pl0tGAAA1cwDxA>: Error:
Timeout during state=master_recv_handshake

server2:

Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
failed: EOF (last sent=handshake, last recv=handshake)

There aren't any additional logs regarding the replication.

I have tried increasing vsz_limit or reducing replication_max_conns.
Nothing changed.

--

Both customers have 10k+ users. Currently I couldn't reproduce this on
smaller test systems.

Both installation were downgraded to 2.3.7.2 to fix the issue for now

--

I've attached a tcpdump showing the client showing the client stops
sending any data after the mailbox_guid table headers.



Any idea what could be wrong here or the debug this issue?

Thanks.

Carsten Rosenberg

--
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.