As someone who is about to begin the process of moving from maildir to mdbox on NFS (and therefore just about to start the 'director-ization' of everything) for ~6.5m mailboxes, I'm curious if anyone can share any experiences with it. The list is surprisingly quiet about this subject, and articles on google are mainly just about setting director up. I've yet to stumble across an article about someone's experiences with it.
How big of a director cluster do you use? I'm going to have millions of mailboxes behind 10 directors. I'm guessing that's plenty. It's actually split over two datacenters. In the larger, we've got about 200k connections currently, so in a perfectly-balanced world, each director would have 20k connections on it. I'm guessing that's child's play. Any good rule of thumb for ratio of 'backend servers::director servers'? In my larger DC, it's about 5::1.
Do you use the perl poolmon script or something else? The perl script was being weird for me, so I rewrote it in python but it basically does the exact same things.
Seen any issues with director? In testing, I managed to wedge things by having my poolmon script running on all the cluster boxes (I think). I've since rewritten it to run *only* on the lowest-numbered director. When it wedged, I had piles (read: hundreds per second) of log entries that said:
Feb 12 06:25:03 director: Warning: director(10.1.20.5:9090/right): Host 10.1.17.3 is being updated before previous update had finished (down -> up)
- setting to state=up vhosts=0 Feb 12 06:25:03 director: Warning: director(10.1.20.5:9090/right): Host 10.1.17.3 is being updated before previous update had finished (up -> down)
- setting to state=down vhosts=0 Feb 12 06:25:03 director: Warning: director(10.1.20.3:9090/left): Host 10.1.17.3 is being updated before previous update had finished (down -> up)
- setting to state=up vhosts=0 Feb 12 06:25:03 director: Warning: director(10.1.20.3:9090/left): Host 10.1.17.3 is being updated before previous update had finished (up -> down)
- setting to state=down vhosts=0
Because it was in testing, I didn't notice it and it was like this for several days till dovecot was restarted on all the director nodes. I'm not 100% on what happened, but my *guess* is that two boxes tried to update the status of the same backend server in rapid succession.
Assuming you're using NFS, do you still see non-trivial amounts of indexes getting corrupted?
Again, assuming NFS and assuming at least some corrupted indexes, what's your guess for success rate % for dovecot recovering them automatically? And how about success rate % for ones that dovecot wasn't able to do automatically but you had to use doveadm to repair it? Really what I'm trying to figure out is 1) how often sysops will need to manually recover indexes; and 2) how often admins *can't* manually recover indexes?
if you have unrecoverable indexes (and assuming you have snapshots on your NFS server), does grabbing the most recent indexes from the snapshots always work for recovery (obviously, up till the point that the snapshot was taken)?
Any gotchas you've seen anywhere in a director-fied stack? I realize that's a broad question :)
Does one of your director nodes going down cause any issues? E.g. issues with the left and right nodes syncing with each other? Or when the director node comes back up?
Does a backend node going down cause a storm of reconnects? In the time between deploying director and getting mailboxes converted to mdbox, reconnects for us will mean cold local-disk dovecot caches. But hopefully consistent hashing helps with that?
Do you have consistent hashing turned on? I can't think of any reason not to have it turned on, but who knows
Any other configuration knobs (including sysctl) that you needed to futz with, vs the default?
I appreciate any feedback!