[Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)

Tue Jan 3 10:14:49 EET 2012

On Sat, Dec 31, 2011 at 01:54:32AM -0600, Stan Hoeppner wrote:
> Nice setup.  I've mentioned GPFS for cluster use on this list before,
> but I think you're the only operator to confirm using it.  I'm sure
> others would be interested in hearing of your first hand experience:
> pros, cons, performance, etc.  And a ball park figure on the licensing
> costs, whether one can only use GPFS on IBM storage or if storage from
> others vendors is allowed in the GPFS pool.

I used to work for IBM, so I've been a bit uneasy about pushing GPFS too
hard publicly, for risk of being accused of being biased. But I changed job in
November, so now I'm only a satisfied customer :-)

Pros:
	Extremely simple to configure and manage. Assuming root on all
	nodes can ssh freely, and port 1191/tcp is open between the
	nodes, these are the commands to create the cluster, create a
	NSD (network shared disks), and create a filesystem:

		# echo hostname1:manager-quorum > NodeFile	# "manager" means this node can be selected as filesystem manager
		# echo hostname2:manager-quorum >> NodeFile	# "quorum" means this node has a vote in the quorum selection
		# echo hostname3:manager-quorum >> NodeFile	# all my nodes are usually the same, so they all have same roles.
		# mmcrcluster  -n  NodeFile  -p $(hostname) -A

		### sdb1 is either a local disk on hostname1 (in which case the other nodes will access it over tcp to
		### hostname1), or a SAN-disk that they can access directly over FC/iSCSI.
		# echo sdb1:hostname1::dataAndMetadata:: > DescFile # This disk can be used for both data and metadata
		# mmcrnsd -F DescFile

		# mmstartup -A	# starts GPFS services on all nodes
		# mmcrfs /gpfs1 gpfs1 -F DescFile
		# mount /gpfs1

	You can add and remove disks from the filesystem, and change most
	settings without downtime. You can scale out your workload by adding
	more nodes (SAN attached or not), and scale out your disk performance
	by adding more disks on the fly. (IBM uses GPFS to create
	scale-out NAS solutions http://www-03.ibm.com/systems/storage/network/sonas/ ,
	which highlights a few of the features available with GPFS)

	There's no problem running GPFS on other vendors disk systems. I've used Nexsan
	SATAboy earlier, for a HPC cluster. One can easily move from one disksystem to
	another without downtime.

Cons:
	It has it's own page cache, staticly configured. So you don't get the "all
	available memory used for page caching" behaviour as you normally do on linux.

	There is a kernel module that needs to be rebuilt on every
	upgrade. It's a simple process, but it needs to be done and means we
	can't just run "yum update ; reboot" to upgrade.

		% export SHARKCLONEROOT=/usr/lpp/mmfs/src
		% cp /usr/lpp/mmfs/src/config/site.mcr.proto /usr/lpp/mmfs/src/config/site.mcr
		% vi /usr/lpp/mmfs/src/config/site.mcr     # correct GPFS_ARCH, LINUX_DISTRIBUTION and LINUX_KERNEL_VERSION
		% cd /usr/lpp/mmfs/src/ ; make clean ; make World
		% su - root
		# export SHARKCLONEROOT=/usr/lpp/mmfs/src
		# cd /usr/lpp/mmfs/src/ ; make InstallImages

> 
> To this point IIRC everyone here doing clusters is using NFS, GFS, or
> OCFS.  Each has its downsides, mostly because everyone is using maildir.
>  NFS has locking issues with shared dovecot index files.  GFS and OCFS
> have filesystem metadata performance issues.  How does GPFS perform with
> your maildir workload?

Maildir is likely a worst case type workload for filesystems. Millions
of tiny-tiny files, making all IO random, and getting minimal controller
read cache utilized (unless you can cache all active files). So I've
concluded that our performance issues are mostly design errors (and the
fact that there were no better mail storage formats than maildir at the
time these servers were implemented). I expect moving to mdbox will 
fix all our performance issues.

I *think* GPFS is as good as it gets for maildir storage on clusterfs,
but have no number to back that up ... Would be very interesting if we
could somehow compare numbers for a few clusterfs'. 

I believe our main limitation in this setup is the iops we can get from
the backend storage system. It's hard to balance the IO over enough
RAID arrays (the fs is spread over 11 RAID5 arrays of 5 disks each),
and we're always having hotspots. Right now two arrays are doing <100 iops,
while others are doing 4-500 iops. Would very much like to replace
it by something smarter where we can utilize SSDs for active data and
something slower for stale data. GPFS can manage this by itself trough
it's ILM interface, but we don't have the very fast storage to put in as
tier-1.

  -jf