Director+NFS Experiences

Thu Feb 23 22:08:55 UTC 2017

As someone who is about to begin the process of moving from maildir to
mdbox on NFS (and therefore just about to start the 'director-ization' of
everything) for ~6.5m mailboxes, I'm curious if anyone can share any
experiences with it. The list is surprisingly quiet about this subject, and
articles on google are mainly just about setting director up. I've yet to
stumble across an article about someone's experiences with it.

* How big of a director cluster do you use? I'm going to have millions of
mailboxes behind 10 directors. I'm guessing that's plenty. It's actually
split over two datacenters. In the larger, we've got about 200k connections
currently, so in a perfectly-balanced world, each director would have 20k
connections on it. I'm guessing that's child's play. Any good rule of thumb
for ratio of 'backend servers::director servers'? In my larger DC, it's
about 5::1.

* Do you use the perl poolmon script or something else? The perl script was
being weird for me, so I rewrote it in python but it basically does the
exact same things.

* Seen any issues with director? In testing, I managed to wedge things by
having my poolmon script running on all the cluster boxes (I think). I've
since rewritten it to run *only* on the lowest-numbered director. When it
wedged, I had piles (read: hundreds per second) of log entries that said:

Feb 12 06:25:03 director: Warning: director(10.1.20.5:9090/right): Host
10.1.17.3 is being updated before previous update had finished (down -> up)
- setting to state=up vhosts=0
Feb 12 06:25:03 director: Warning: director(10.1.20.5:9090/right): Host
10.1.17.3 is being updated before previous update had finished (up -> down)
- setting to state=down vhosts=0
Feb 12 06:25:03 director: Warning: director(10.1.20.3:9090/left): Host
10.1.17.3 is being updated before previous update had finished (down -> up)
- setting to state=up vhosts=0
Feb 12 06:25:03 director: Warning: director(10.1.20.3:9090/left): Host
10.1.17.3 is being updated before previous update had finished (up -> down)
- setting to state=down vhosts=0

Because it was in testing, I didn't notice it and it was like this for
several days till dovecot was restarted on all the director nodes. I'm not
100% on what happened, but my *guess* is that two boxes tried to update the
status of the same backend server in rapid succession.

* Assuming you're using NFS, do you still see non-trivial amounts of
indexes getting corrupted?

* Again, assuming NFS and assuming at least some corrupted indexes, what's
your guess for success rate % for dovecot recovering them automatically?
And how about success rate % for ones that dovecot wasn't able to do
automatically but you had to use doveadm to repair it? Really what I'm
trying to figure out is 1) how often sysops will need to manually recover
indexes; and 2) how often admins *can't* manually recover indexes?

* if you have unrecoverable indexes (and assuming you have snapshots on
your NFS server), does grabbing the most recent indexes from the snapshots
always work for recovery (obviously, up till the point that the snapshot
was taken)?

* Any gotchas you've seen anywhere in a director-fied stack? I realize
that's a broad question :)

* Does one of your director nodes going down cause any issues? E.g. issues
with the left and right nodes syncing with each other? Or when the director
node comes back up?

* Does a backend node going down cause a storm of reconnects? In the time
between deploying director and getting mailboxes converted to mdbox,
reconnects for us will mean cold local-disk dovecot caches. But hopefully
consistent hashing helps with that?

* Do you have consistent hashing turned on? I can't think of any reason not
to have it turned on, but who knows

* Any other configuration knobs (including sysctl) that you needed to futz
with, vs the default?

I appreciate any feedback!