Timo/Aki/Docecot guys, any hints here? Is this a bug? Design issue?
On Fri, Apr 7, 2017 at 10:10 AM Mark Moseley moseleymark@gmail.com wrote:
On Mon, Apr 3, 2017 at 6:04 PM, Mark Moseley moseleymark@gmail.com wrote:
We just had a bunch of backend boxes go down due to a DDoS in our director cluster. When the DDoS died down, our director ring was a mess.
Each box had thousands (and hundreds per second, which is a bit much) of log lines like the following:
Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.10:9090/left): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (up -> down) - setting to state=down vhosts=100 Apr 03 19:59:29 director: Warning: director(10.1.20.2:9090/right): Host 10.1.17.15 is being updated before previous update had finished (down -> up) - setting to state=up vhosts=100
This was on every director box and the status of all of the directors in 'doveadm director ring status' was 'handshaking'.
Here's a sample packet between directors:
19:51:23.552280 IP 10.1.20.10.56670 > 10.1.20.1.9090: Flags [P.], seq 4147:5128, ack 0, win 0, options [nop,nop,TS val 1373505883 ecr 1721203906], length 981
Q. [f.|.HOST 10.1.20.10 9090 1006732 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006733 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006734 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006735 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006736 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006737 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006738 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006739 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006740 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006741 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006742 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006743 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006744 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006745 10.1.17.15 100 U1491260800 HOST 10.1.20.10 9090 1006746 10.1.17.15 100 D1491260800 HOST 10.1.20.10 9090 1006747 10.1.17.15 100 U1491260800 SYNC 10.1.20.10 9090 1011840 7 1491263483 3377546382
I'm guessing that D1491260800 is the user hash (with D for down), and the U version is for 'up'.
I'm happy to provide the full tcpdump (and/or doveconf -a), though the tcpdump is basically all identical the one I pasted (same hash, same host).
This seems pretty fragile. There should be some sort of tie break for that, instead of bringing the entire cluster to its knees. Or just drop the backend host completely. Or something, anything besides hosing things pretty badly.
This is 2.2.27, on both the directors and backend. If the answer is upgrade to 2.2.28, then I'll upgrade immediately. I see commit a9ade104616bbb81c34cc6f8bfde5dab0571afac mentions the same error but the commit predates 2.2.27 by a month and a half.
In the meantime, is there any doveadm command I could've done to fix this? I tried removing the host (doveadm director remove 10.1.17.15) but that didn't do anything. I didn't think to try to flush the mapping for that user till just now. I suspect that with the ring unsync'd, flushing the user wouldn't have helped.
The only remedy was to kill dovecot on every box in the director cluster and then (with dovecot down on *all* of them) start dovecot back up. Restarting each director's dovecot (with other directors' dovecots still running) did nothing. Only by brining the entire cluster down did dovecot stop furiously logging "Host ... is being updated before previous update had finished" on every director host.
Any advice is most welcome.
Dovecot guys,
Any thing I can do to fix this? Any thoughts on that error in general?