[Dovecot] Dovecot + DRBD/GFS mailstore

Sat Sep 26 00:51:22 EEST 2009

Quoting Mario Antonio <support at webjogger.net>:

> How does the system behave when you shutdown one server, and bring  
> it back later ?  (are you using an IP load balancer/heart beat etc ?)

I'm just using RHCS with GFS over DRBD.  DRBD and LVM are started by
the system (not managed by the cluster) and everything else (including
GFS) is managed by RHCS.  So there is no load balancer, and nothing
external to RHCS like heartbeat et al. (There is a two-cluster active/passive
firewall in front of these that acts as a traffic director, but it isn't
concerned about load balancing, and is a separate stand-alone cluster
from the one running DRBD and GFS).

The DRBD+GFS cluster is a simple 3 node RHCS cluster.  Two nodes (mailer1
and mailer2) run DRBD+GFS (active/active), while the 3rd node (webmail1)
does not (just local ext3 file systems).  I may add more nodes in the
future if needed, but so far this is sufficient for my needs.  The third
node is nice as it prevents cluster (not DRBD) split-brain situations, and
allows me to maintain real quorum when I need to reboot a node, etc.

BTW, they are all running CentOS 5.3 (started on RHEL, moved to CentOS
which I actually find easier to use for DRBD/GFS/etc than RHEL).

If I do an orderly shutdown of the node, it all works fine.  All
services fail-over at the shutdown to the remaining node without a hitch.

At startup, they almost always migrate back automatically, and if not I
can migrate them back later by hand.  The reason they don't always migrate
back at startup seems to be that if the node is down too long, then drbd
takes a while to sync back up, and this can prevent lvm and gfs from
starting at boot, which means of course the services can't migrate back.
(I don't have drbd and lvm under cluster control, so if they don't start
at boot, I need to manually fix them).

If I 'crash' a node (kill the power, reboot it via a hardware stonith card,
etc) sometimes it doesn't work so fine and I need to manually intervene.
Often it will all come up fine, but sometimes the drbd won't come up as
primary/primary, and I'll need to fix it by hand.  Or sometimes the drbd
will come up, but the lvm or gfs won't (like above).  So often I have to
manually fix things.

But the good news is that in any case (shutdown, crash, etc) the cluster
is always up and running, since only one node is down...  So my services
are always available, though maybe slower when a node isn't participating
properly.  Not the best situation, but certainly I'm able to live with it.

My main goal was to be able to do orderly shutdowns, and that works great.
That way I can update kernels, tweak hardware (e.g., add RAM or upgrade
disks), etc. with no real service interruption.  So I'm not as worried
about the "crash" situation, since it happens so much less often than the
orderly shutdown, which was my main concern.

In any case, after many shutdowns and crashes and bad software upgrades
and such, I've not lost any data or anything like that.  Overall I'm
very happy.  Sure, I could be a bit happier with the recovery after
a crash, but I'm tickled with the way it works the rest of the time,
and it is a large improvement over my old setup.

> Regards,
>
> Mario Antonio

-- 
Eric Rostetter
The Department of Physics
The University of Texas at Austin

This message is provided "AS IS" without warranty of any kind,
either expressed or implied.  Use this message at your own risk.