Eric Jon Rostetter wrote:
Quoting Mario Antonio <support@webjogger.net>:
How does the system behave when you shutdown one server, and bring it back later ? (are you using an IP load balancer/heart beat etc ?)
I'm just using RHCS with GFS over DRBD. DRBD and LVM are started by the system (not managed by the cluster) and everything else (including GFS) is managed by RHCS. So there is no load balancer, and nothing external to RHCS like heartbeat et al. (There is a two-cluster active/passive firewall in front of these that acts as a traffic director, but it isn't concerned about load balancing, and is a separate stand-alone cluster from the one running DRBD and GFS).
The DRBD+GFS cluster is a simple 3 node RHCS cluster. Two nodes (mailer1 and mailer2) run DRBD+GFS (active/active), while the 3rd node (webmail1) does not (just local ext3 file systems). I may add more nodes in the future if needed, but so far this is sufficient for my needs. The third node is nice as it prevents cluster (not DRBD) split-brain situations, and allows me to maintain real quorum when I need to reboot a node, etc.
BTW, they are all running CentOS 5.3 (started on RHEL, moved to CentOS which I actually find easier to use for DRBD/GFS/etc than RHEL).
If I do an orderly shutdown of the node, it all works fine. All services fail-over at the shutdown to the remaining node without a hitch.
At startup, they almost always migrate back automatically, and if not I can migrate them back later by hand. The reason they don't always migrate back at startup seems to be that if the node is down too long, then drbd takes a while to sync back up, and this can prevent lvm and gfs from starting at boot, which means of course the services can't migrate back. (I don't have drbd and lvm under cluster control, so if they don't start at boot, I need to manually fix them).
If I 'crash' a node (kill the power, reboot it via a hardware stonith card, etc) sometimes it doesn't work so fine and I need to manually intervene. Often it will all come up fine, but sometimes the drbd won't come up as primary/primary, and I'll need to fix it by hand. Or sometimes the drbd will come up, but the lvm or gfs won't (like above). So often I have to manually fix things.
But the good news is that in any case (shutdown, crash, etc) the cluster is always up and running, since only one node is down... So my services are always available, though maybe slower when a node isn't participating properly. Not the best situation, but certainly I'm able to live with it.
My main goal was to be able to do orderly shutdowns, and that works great. That way I can update kernels, tweak hardware (e.g., add RAM or upgrade disks), etc. with no real service interruption. So I'm not as worried about the "crash" situation, since it happens so much less often than the orderly shutdown, which was my main concern.
In any case, after many shutdowns and crashes and bad software upgrades and such, I've not lost any data or anything like that. Overall I'm very happy. Sure, I could be a bit happier with the recovery after a crash, but I'm tickled with the way it works the rest of the time, and it is a large improvement over my old setup.
Regards,
Mario Antonio
Great! Any good documentation regarding building a RHCS with GFS over DRBD ...? (or just the Rethat web site ..) Just curious, which Dovecot Version are you using? and which Web-mail system? and Postfix or Exim? and user database on Mysql or Ldap?
M.A.