[Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)

Tue Jan 3 17:20:28 EET 2012

On 1/3/2012 2:14 AM, Jan-Frode Myklebust wrote:
> On Sat, Dec 31, 2011 at 01:54:32AM -0600, Stan Hoeppner wrote:
>> Nice setup.  I've mentioned GPFS for cluster use on this list before,
>> but I think you're the only operator to confirm using it.  I'm sure
>> others would be interested in hearing of your first hand experience:
>> pros, cons, performance, etc.  And a ball park figure on the licensing
>> costs, whether one can only use GPFS on IBM storage or if storage from
>> others vendors is allowed in the GPFS pool.
> 
> I used to work for IBM, so I've been a bit uneasy about pushing GPFS too
> hard publicly, for risk of being accused of being biased. But I changed job in
> November, so now I'm only a satisfied customer :-)

Fascinating.  And good timing. :)

> Pros:
> 	Extremely simple to configure and manage. Assuming root on all
> 	nodes can ssh freely, and port 1191/tcp is open between the
> 	nodes, these are the commands to create the cluster, create a
> 	NSD (network shared disks), and create a filesystem:
> 
> 		# echo hostname1:manager-quorum > NodeFile	# "manager" means this node can be selected as filesystem manager
> 		# echo hostname2:manager-quorum >> NodeFile	# "quorum" means this node has a vote in the quorum selection
> 		# echo hostname3:manager-quorum >> NodeFile	# all my nodes are usually the same, so they all have same roles.
> 		# mmcrcluster  -n  NodeFile  -p $(hostname) -A
> 
> 		### sdb1 is either a local disk on hostname1 (in which case the other nodes will access it over tcp to
> 		### hostname1), or a SAN-disk that they can access directly over FC/iSCSI.
> 		# echo sdb1:hostname1::dataAndMetadata:: > DescFile # This disk can be used for both data and metadata
> 		# mmcrnsd -F DescFile
> 
> 		# mmstartup -A	# starts GPFS services on all nodes
> 		# mmcrfs /gpfs1 gpfs1 -F DescFile
> 		# mount /gpfs1
> 
> 	You can add and remove disks from the filesystem, and change most
> 	settings without downtime. You can scale out your workload by adding
> 	more nodes (SAN attached or not), and scale out your disk performance
> 	by adding more disks on the fly. (IBM uses GPFS to create
> 	scale-out NAS solutions http://www-03.ibm.com/systems/storage/network/sonas/ ,
> 	which highlights a few of the features available with GPFS)
> 
> 	There's no problem running GPFS on other vendors disk systems. I've used Nexsan
> 	SATAboy earlier, for a HPC cluster. One can easily move from one disksystem to
> 	another without downtime.

That's good to know.  The only FC SAN arrays I've installed/used are IBM
FasTt 600 and Nexsan SataBlade/Boy.  I much prefer the web management
interface on the Nexsan units, much more intuitive, more flexible.  The
FasTt is obviously much more suitable to random IOPS workloads with its
15k FC disks vs 7.2K SATA disks in the Nexsan units (although Nexsan has
offered 15K SAS disks and SSDs for a while now).

> Cons:
> 	It has it's own page cache, staticly configured. So you don't get the "all
> 	available memory used for page caching" behaviour as you normally do on linux.

Yep, that's ugly.

> 	There is a kernel module that needs to be rebuilt on every
> 	upgrade. It's a simple process, but it needs to be done and means we
> 	can't just run "yum update ; reboot" to upgrade.
> 
> 		% export SHARKCLONEROOT=/usr/lpp/mmfs/src
> 		% cp /usr/lpp/mmfs/src/config/site.mcr.proto /usr/lpp/mmfs/src/config/site.mcr
> 		% vi /usr/lpp/mmfs/src/config/site.mcr     # correct GPFS_ARCH, LINUX_DISTRIBUTION and LINUX_KERNEL_VERSION
> 		% cd /usr/lpp/mmfs/src/ ; make clean ; make World
> 		% su - root
> 		# export SHARKCLONEROOT=/usr/lpp/mmfs/src
> 		# cd /usr/lpp/mmfs/src/ ; make InstallImages

So is this, but it's totally expected since this is proprietary code and
not in mainline.

>> To this point IIRC everyone here doing clusters is using NFS, GFS, or
>> OCFS.  Each has its downsides, mostly because everyone is using maildir.
>>  NFS has locking issues with shared dovecot index files.  GFS and OCFS
>> have filesystem metadata performance issues.  How does GPFS perform with
>> your maildir workload?
> 
> Maildir is likely a worst case type workload for filesystems. Millions
> of tiny-tiny files, making all IO random, and getting minimal controller
> read cache utilized (unless you can cache all active files). So I've

Yep.  Which is the reason I've stuck with mbox everywhere I can over the
years, minor warts and all, and will be moving to mdbox at some point.
IMHO maildir solved one set of problems but created a bigger problem.
Many sites hailed maildir as a savior in many ways, then decried it as
their user base and IO demands exceeded their storage, scrambling for
budget money for fix an "unforeseen" problem, which is absolutely clear
from day one.  At least for anyone with more than a cursory knowledge of
filesystem design and hardware performance.

> concluded that our performance issues are mostly design errors (and the
> fact that there were no better mail storage formats than maildir at the
> time these servers were implemented). I expect moving to mdbox will 
> fix all our performance issues.

Yeah, it should decrease FS IOPS by a couple orders or magnitude,
especially if you go with large mdbox files.  The larger the better.

> I *think* GPFS is as good as it gets for maildir storage on clusterfs,
> but have no number to back that up ... Would be very interesting if we
> could somehow compare numbers for a few clusterfs'. 

Apparently no one (vendor) with the resources to do so has the desire to
do so.

> I believe our main limitation in this setup is the iops we can get from
> the backend storage system. It's hard to balance the IO over enough
> RAID arrays (the fs is spread over 11 RAID5 arrays of 5 disks each),
> and we're always having hotspots. Right now two arrays are doing <100 iops,
> while others are doing 4-500 iops. Would very much like to replace
> it by something smarter where we can utilize SSDs for active data and
> something slower for stale data. GPFS can manage this by itself trough
> it's ILM interface, but we don't have the very fast storage to put in as
> tier-1.

Obviously not news to you, balancing mail workload IO across large
filesystems and wide disk farms will always be a problem, due to which
users are logged in at a given moment, and the fact you can't stripe all
users' small mail files across all disks.  And this is true of all
mailbox formats to one degree or another, obviously worst with maildir.
 A properly engineered XFS can get far closer to linear IO distribution
across arrays than most filesystems due to its allocation group design,
but it still won't be perfect.  Simply getting away from maildir, with
its extraneous metadata IOs, is a huge win for decreasing custerFS and
SAN IOPs.  I'm anxious to see your report on your SAN IOPs after you've
converted to mdbox, especially if you go with 16/32MB or larger mdbox files.

-- 
Stan