Host ... is being updated before previous update had finished

Tue Apr 4 04:04:20 EEST 2017

We just had a bunch of backend boxes go down due to a DDoS in our director
cluster. When the DDoS died down, our director ring was a mess.

Each box had thousands (and hundreds per second, which is a bit much) of
log lines like the following:

Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100

This was on every director box and the status of all of the directors in
'doveadm director ring status' was 'handshaking'.

Here's a sample packet between directors:

19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
1721203906], length 981

Q.      [f.|.HOST       10.1.20.10      9090    1006732 10.1.17.15      100
            D1491260800
HOST    10.1.20.10      9090    1006733 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006734 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006735 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006736 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006737 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006738 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006739 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006740 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006741 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006742 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006743 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006744 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006745 10.1.17.15      100
U1491260800
HOST    10.1.20.10      9090    1006746 10.1.17.15      100
D1491260800
HOST    10.1.20.10      9090    1006747 10.1.17.15      100
U1491260800
SYNC    10.1.20.10      9090    1011840 7       1491263483      3377546382

I'm guessing that D1491260800 is the user hash (with D for down), and the U
version is for 'up'.

I'm happy to provide the full tcpdump (and/or doveconf -a), though the
tcpdump is basically all identical the one I pasted (same hash, same host).

This seems pretty fragile. There should be some sort of tie break for that,
instead of bringing the entire cluster to its knees. Or just drop the
backend host completely. Or something, anything besides hosing things
pretty badly.

This is 2.2.27, on both the directors and backend. If the answer is upgrade
to 2.2.28, then I'll upgrade immediately. I see
commit a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but
the commit predates 2.2.27 by a month and a half.

In the meantime, is there any doveadm command I could've done to fix this?
I tried removing the host (doveadm director remove 10.1.17.15) but that
didn't do anything. I didn't think to try to flush the mapping for that
user till just now. I suspect that with the ring unsync'd, flushing the
user wouldn't have helped.

The only remedy was to kill dovecot on every box in the director cluster
and then (with dovecot down on *all* of them) start dovecot back up.
Restarting each director's dovecot (with other directors' dovecots still
running) did nothing. Only by brining the entire cluster down did dovecot
stop furiously logging "Host ... is being updated before previous update
had finished" on every director host.

Any advice is most welcome.