Re: [Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs)
Great information, thank you. Could you remark on GPFS services hosting mail storage over a WAN between two geographically separated data centers?
----- Reply message ----- From: "Jan-Frode Myklebust" janfrode@tanso.net To: "Stan Hoeppner" stan@hardwarefreak.com Cc: "Timo Sirainen" tss@iki.fi, dovecot@dovecot.org Subject: [Dovecot] GPFS for mail-storage (Was: Re: Compressing existing maildirs) Date: Tue, Jan 3, 2012 2:14 am
On Sat, Dec 31, 2011 at 01:54:32AM -0600, Stan Hoeppner wrote:
Nice setup. I've mentioned GPFS for cluster use on this list before, but I think you're the only operator to confirm using it. I'm sure others would be interested in hearing of your first hand experience: pros, cons, performance, etc. And a ball park figure on the licensing costs, whether one can only use GPFS on IBM storage or if storage from others vendors is allowed in the GPFS pool.
I used to work for IBM, so I've been a bit uneasy about pushing GPFS too hard publicly, for risk of being accused of being biased. But I changed job in November, so now I'm only a satisfied customer :-)
Pros: Extremely simple to configure and manage. Assuming root on all nodes can ssh freely, and port 1191/tcp is open between the nodes, these are the commands to create the cluster, create a NSD (network shared disks), and create a filesystem:
# echo hostname1:manager-quorum > NodeFile # "manager" means this node can be selected as filesystem manager
# echo hostname2:manager-quorum >> NodeFile # "quorum" means this node has a vote in the quorum selection
# echo hostname3:manager-quorum >> NodeFile # all my nodes are usually the same, so they all have same roles.
# mmcrcluster -n NodeFile -p $(hostname) -A
### sdb1 is either a local disk on hostname1 (in which case the other nodes will access it over tcp to
### hostname1), or a SAN-disk that they can access directly over FC/iSCSI.
# echo sdb1:hostname1::dataAndMetadata:: > DescFile # This disk can be used for both data and metadata
# mmcrnsd -F DescFile
# mmstartup -A # starts GPFS services on all nodes
# mmcrfs /gpfs1 gpfs1 -F DescFile
# mount /gpfs1
You can add and remove disks from the filesystem, and change most
settings without downtime. You can scale out your workload by adding
more nodes (SAN attached or not), and scale out your disk performance
by adding more disks on the fly. (IBM uses GPFS to create
scale-out NAS solutions http://www-03.ibm.com/systems/storage/network/sonas/ ,
which highlights a few of the features available with GPFS)
There's no problem running GPFS on other vendors disk systems. I've used Nexsan
SATAboy earlier, for a HPC cluster. One can easily move from one disksystem to
another without downtime.
Cons: It has it's own page cache, staticly configured. So you don't get the "all available memory used for page caching" behaviour as you normally do on linux.
There is a kernel module that needs to be rebuilt on every
upgrade. It's a simple process, but it needs to be done and means we
can't just run "yum update ; reboot" to upgrade.
% export SHARKCLONEROOT=/usr/lpp/mmfs/src
% cp /usr/lpp/mmfs/src/config/site.mcr.proto /usr/lpp/mmfs/src/config/site.mcr
% vi /usr/lpp/mmfs/src/config/site.mcr # correct GPFS_ARCH, LINUX_DISTRIBUTION and LINUX_KERNEL_VERSION
% cd /usr/lpp/mmfs/src/ ; make clean ; make World
% su - root
# export SHARKCLONEROOT=/usr/lpp/mmfs/src
# cd /usr/lpp/mmfs/src/ ; make InstallImages
To this point IIRC everyone here doing clusters is using NFS, GFS, or OCFS. Each has its downsides, mostly because everyone is using maildir. NFS has locking issues with shared dovecot index files. GFS and OCFS have filesystem metadata performance issues. How does GPFS perform with your maildir workload?
Maildir is likely a worst case type workload for filesystems. Millions of tiny-tiny files, making all IO random, and getting minimal controller read cache utilized (unless you can cache all active files). So I've concluded that our performance issues are mostly design errors (and the fact that there were no better mail storage formats than maildir at the time these servers were implemented). I expect moving to mdbox will fix all our performance issues.
I *think* GPFS is as good as it gets for maildir storage on clusterfs, but have no number to back that up ... Would be very interesting if we could somehow compare numbers for a few clusterfs'.
I believe our main limitation in this setup is the iops we can get from the backend storage system. It's hard to balance the IO over enough RAID arrays (the fs is spread over 11 RAID5 arrays of 5 disks each), and we're always having hotspots. Right now two arrays are doing <100 iops, while others are doing 4-500 iops. Would very much like to replace it by something smarter where we can utilize SSDs for active data and something slower for stale data. GPFS can manage this by itself trough it's ILM interface, but we don't have the very fast storage to put in as tier-1.
-jf
On Wed, Jan 04, 2012 at 12:09:39AM -0600, list@airstreamcomm.net wrote:
Could you remark on GPFS services hosting mail storage over a WAN between two geographically separated data centers?
I haven't tried that, but know the theory quite well. There are 2 or 3 options:
1 - shared SAN between the data centers. Should work the same as
a single data center, but you'd want to use disk quorum or
a quorum node on a 3. site to avoid split brain.
2 - different SANs on the two sites. Disks on SAN1 would belong
to failure group 1 and disks on SAN2 would belong to failure
group 2. GPFS will write every block to disks in different
failure groups. Nodes on location 1 will use SAN1 directly,
and write to SAN2 via tcp/ip to nodes on location 2 (and vica
versa). It's configurable if you want to return success when
first block is written (asyncronous replication), or if you
need both replicas to be written. Ref: mmcrfs -K:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.v3r4.gpfs300.doc%2Fbl1adm_mmcrfs.html
With asyncronous replication it will try to allocate both
replicas, but if it fails you can re-establish the
replication level later using "mmrestripefs".
Reading will happen from a direct attached disk if possible,
and over tcp/ip if there are no local replica of the needed
block.
Again you'll need a quorum node on a 3. site to avoid split brain.
3 - GPFS multi-cluster. Separate GPFS clusters on the two
locations. Let them mount each others filesystems over IP,
and access disks over either SAN or IP network. Each cluster is
managed locally, if one site goes down the other site also
loses access to the fs.
I don't have any experience with this kind of config, but believe
it's quite popular to use to share fs between HPC-sites.
http://www.ibm.com/developerworks/systems/library/es-multiclustergpfs/index.html
http://www.cisl.ucar.edu/hss/ssg/presentations/storage/NCAR-GPFS_Elahi.pdf
-jf
participants (2)
-
Jan-Frode Myklebust
-
list@airstreamcomm.net