Host ... is being updated before previous update had finished

Fri Apr 7 20:10:43 EEST 2017

On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley <moseleymark at gmail.com> wrote:

> We just had a bunch of backend boxes go down due to a DDoS in our director
> cluster. When the DDoS died down, our director ring was a mess.
>
> Each box had thousands (and hundreds per second, which is a bit much) of
> log lines like the following:
>
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (up ->
> down) - setting to state=down vhosts=100
> Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host
> 10.1.17.15 is being updated before previous update had finished (down ->
> up) - setting to state=up vhosts=100
>
> This was on every director box and the status of all of the directors in
> 'doveadm director ring status' was 'handshaking'.
>
> Here's a sample packet between directors:
>
> 19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq
> 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr
> 1721203906], length 981
>
> Q.      [f.|.HOST       10.1.20.10      9090    1006732 10.1.17.15
>  100             D1491260800
> HOST    10.1.20.10      9090    1006733 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006734 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006735 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006736 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006737 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006738 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006739 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006740 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006741 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006742 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006743 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006744 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006745 10.1.17.15      100
> U1491260800
> HOST    10.1.20.10      9090    1006746 10.1.17.15      100
> D1491260800
> HOST    10.1.20.10      9090    1006747 10.1.17.15      100
> U1491260800
> SYNC    10.1.20.10      9090    1011840 7       1491263483      3377546382
>
> I'm guessing that D1491260800 is the user hash (with D for down), and the
> U version is for 'up'.
>
> I'm happy to provide the full tcpdump (and/or doveconf -a), though the
> tcpdump is basically all identical the one I pasted (same hash, same host).
>
> This seems pretty fragile. There should be some sort of tie break for
> that, instead of bringing the entire cluster to its knees. Or just drop the
> backend host completely. Or something, anything besides hosing things
> pretty badly.
>
> This is 2.2.27, on both the directors and backend. If the answer is
> upgrade to 2.2.28, then I'll upgrade immediately. I see commit
> a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but the
> commit predates 2.2.27 by a month and a half.
>
> In the meantime, is there any doveadm command I could've done to fix this?
> I tried removing the host (doveadm director remove 10.1.17.15) but that
> didn't do anything. I didn't think to try to flush the mapping for that
> user till just now. I suspect that with the ring unsync'd, flushing the
> user wouldn't have helped.
>
> The only remedy was to kill dovecot on every box in the director cluster
> and then (with dovecot down on *all* of them) start dovecot back up.
> Restarting each director's dovecot (with other directors' dovecots still
> running) did nothing. Only by brining the entire cluster down did dovecot
> stop furiously logging "Host ... is being updated before previous update
> had finished" on every director host.
>
> Any advice is most welcome.
>

Dovecot guys,

Any thing I can do to fix this? Any thoughts on that error in general?