[2.3.8] possible replication issue

Piper Andreas piper at hrz.uni-marburg.de
Fri Dec 6 08:09:34 EET 2019


Hello Timo,

upgrading both replicators did the job! Both replicators now run v2.3.9 
and replication works fine, all sync-jobs which queued up during the 
upgrading have been processed successfully.

Thanks for the reassurement and all your great work with dovecot,

Andreas


Am 05.12.19 um 13:15 schrieb Timo Sirainen via dovecot:
> I think there's a good chance that upgrading both will fix it. The bug 
> already existed in old versions, it just wasn't normally triggered. 
> Since v2.3.8 this situation is triggered on one dsync side, so the 
> v2.3.9 fix needs to be on the other side.
> 
>> On 5. Dec 2019, at 8.34, Piper Andreas via dovecot 
>> <dovecot at dovecot.org <mailto:dovecot at dovecot.org>> wrote:
>>
>> Hello,
>>
>> upgrading to 2.3.9 unfortunately does *not* solve this issue:
>>
>> I upgraded one of my replicators from 2.3.7.2 to 2.3.9 and after some 
>> seconds replication stopped. The other replicator remained with 
>> 2.3.7.2. After downgrading to 2.3.7.2 replication is again working fine.
>>
>> I did not try to upgrade both replicators up to now, as this is a live 
>> production system. Is there a chance, that upgrading both replicators 
>> will solve the problem?
>>
>> The machines are running Ubuntu 18.04
>>
>> Any help is appreciated.
>>
>> Thanks,
>> Andreas
>>
>> Am 18.10.19 um 13:52 schrieb Carsten Rosenberg via dovecot:
>>> Hi,
>>> some of our customers have discovered a replication issue after
>>> upgraded from 2.3.7.2 to 2.3.8.
>>> Running 2.3.8 several replication connections are hanging until defined
>>> timeout. So after some seconds there are $replication_max_conns hanging
>>> connections.
>>> Other replications are running fast and successful.
>>> Also running a doveadm sync tcp:... is working fine for all users.
>>> I can't see exactly, but I haven't seen mailboxes timeouting again and
>>> again. So I would assume it's not related to the mailbox.
>>> From the logs:
>>> server1:
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local(username1 at domain.com 
>>> <mailto:username1 at domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> dsync(172.16.0.1): I/O has stalled, no activity for 600 seconds (version
>>> not received)
>>> Oct 16 08:29:25 server1 dovecot[5715]:
>>> dsync-local(username1 at domain.com 
>>> <mailto:username1 at domain.com>)<FXnVDW22pl0tGAAA1cwDxA>: Error:
>>> Timeout during state=master_recv_handshake
>>> server2:
>>> Oct 16 08:29:25 server2 dovecot[8113]: doveadm: Error: read(server1)
>>> failed: EOF (last sent=handshake, last recv=handshake)
>>> There aren't any additional logs regarding the replication.
>>> I have tried increasing vsz_limit or reducing replication_max_conns.
>>> Nothing changed.
>>> --
>>> Both customers have 10k+ users. Currently I couldn't reproduce this on
>>> smaller test systems.
>>> Both installation were downgraded to 2.3.7.2 to fix the issue for now
>>> --
>>> I've attached a tcpdump showing the client showing the client stops
>>> sending any data after the mailbox_guid table headers.
>>> Any idea what could be wrong here or the debug this issue?
>>> Thanks.
>>> Carsten Rosenberg
>>
>>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5394 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://dovecot.org/pipermail/dovecot/attachments/20191206/16e95864/attachment.p7s>


More information about the dovecot mailing list