Quoting Stan Hoeppner stan@hardwarefreak.com:
Johan Hendriks put forth on 11/3/2010 3:32 AM:
Hello, i am working primarly with FreeBSD, and the latest release has a service called HAST. See it as a mirrored disk over the network.
This is similar to the DRBD solution.
With CARP in the mix, when the master machine fails, it starts dovecot on the slave. This way you have a failover without user interference.
This is similar to heartbeat, or RHCS, etc.
- How do you automatically redirect clients to the IP address of the slave when the master goes down? Is this seamless? What is the duration of "server down" seen by clients? Seconds, minutes?
Usually there is a "floating IP" that the clients used. Which ever server is active has this IP assigned (usually in addition to another IP used for management and such).
The transition time depends on how the master goes down. If you do an administrative shutdown or transfer, it is usually just a fraction of a second for the change to take affect, and maybe a bit longer for the router/switch to get the new MAC address for the IP and route things correctly.
If the primary crashes/dies, then it is usually several seconds before the secondary confirms the primary is in trouble, makes sure it is really down (STOMITH), and takes over the IP, mounts any needed filesystems, and starts any needed services... In this case, the arp/MAC issue isn't really a problem because the transition takes a longer time.
- When you bring the master back up after repairing the cause of the failure, does it automatically and correctly resume mirroring of the HAST device so it obtains the new emails that were saved to the slave while it was offline? How do you then put the master back into service and make the slave offline again?
DRBD does (or at least can, it is configurable). Sometimes you might just do role reversal (old primary becomes secondary, old secondary stays the primary). Other times you might have the original primary become primary again (say, if the original primary has "better" hardware, etc).
So, these things really depend on the use case, and the failure case... And are usually configurable. :)
I can give two personal examples. First I have a file server, which is active-passive cluster. Since the hardware is identical, when one fails, it is promoted to primary. When the dead one comes back, it stays as secondary. It is all automatic via RHCS and DRBD using ext3. Always feels like I'm wasting a machine, but it is rock solid...
Second I have a mail cluster which is active-active (still RHCS but with DRBD+GFS2). When both nodes are up, one does the pop/imap, mailing list web/cli/email interface, and slave LDAP services, while the other node does the mailing list processing, SMTP processing, anti-virus/spam processing, etc. When one machine goes down, the services on that machine migrate automatically to the other machine. When the machine comes back up, the services migrate back to their "home" machine.
Time for failover is a second or two for an admin failover, and for a crash/etc maybe 15-30 seconds max for the fileserver, and 10-15 seconds for the mail server. During the failover, connections may hang or fail, but most clients just retry the connection and get the new machine without user intervention (or in the case of email clients, sometimes they annoying ask for the password again, but that is not too bad). I've never had anyone contact me during either type of failover, which makes me think they either don't notice, or they write it off as a "normal network hiccup" kind of thing (well, they did contact me once, when the failover failed, and the service went completely down, but that was my fault).
So, again, the answer is, as always, "it depends..."
-- Stan
-- Eric Rostetter The Department of Physics The University of Texas at Austin
This message is provided "AS IS" without warranty of any kind, either expressed or implied. Use this message at your own risk.