Host ... is being updated before previous update had finished
Mark Moseley
moseleymark at gmail.com
Tue Apr 4 04:04:20 EEST 2017
We just had a bunch of backend boxes go down due to a DDoS in our director
cluster. When the DDoS died down, our director ring was a mess.
Each box had thousands (and hundreds per second, which is a bit much) of
log lines like the following:
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (up ->
down) - setting to state=down vhosts=100
Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
10.1.17.15 is being updated before previous update had finished (down ->
up) - setting to state=up vhosts=100
This was on every director box and the status of all of the directors in
'doveadm director ring status' was 'handshaking'.
Here's a sample packet between directors:
19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
1721203906], length 981
Q. [f.|.HOST 10.1.20.10 9090 1006732 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006733 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006734 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006735 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006736 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006737 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006738 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006739 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006740 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006741 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006742 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006743 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006744 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006745 10.1.17.15 100
U1491260800
HOST 10.1.20.10 9090 1006746 10.1.17.15 100
D1491260800
HOST 10.1.20.10 9090 1006747 10.1.17.15 100
U1491260800
SYNC 10.1.20.10 9090 1011840 7 1491263483 3377546382
I'm guessing that D1491260800 is the user hash (with D for down), and the U
version is for 'up'.
I'm happy to provide the full tcpdump (and/or doveconf -a), though the
tcpdump is basically all identical the one I pasted (same hash, same host).
This seems pretty fragile. There should be some sort of tie break for that,
instead of bringing the entire cluster to its knees. Or just drop the
backend host completely. Or something, anything besides hosing things
pretty badly.
This is 2.2.27, on both the directors and backend. If the answer is upgrade
to 2.2.28, then I'll upgrade immediately. I see
commit a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but
the commit predates 2.2.27 by a month and a half.
In the meantime, is there any doveadm command I could've done to fix this?
I tried removing the host (doveadm director remove 10.1.17.15) but that
didn't do anything. I didn't think to try to flush the mapping for that
user till just now. I suspect that with the ring unsync'd, flushing the
user wouldn't have helped.
The only remedy was to kill dovecot on every box in the director cluster
and then (with dovecot down on *all* of them) start dovecot back up.
Restarting each director's dovecot (with other directors' dovecots still
running) did nothing. Only by brining the entire cluster down did dovecot
stop furiously logging "Host ... is being updated before previous update
had finished" on every director host.
Any advice is most welcome.
More information about the dovecot
mailing list