On 3/24/2013 1:45 PM, Timo Sirainen wrote:
On 24.3.2013, at 18.12, Tigran Petrosyan tpetrosy@gmail.com wrote:
We are going to implement the "Dovecot" for 1 million users. We are going to use more than 100T storage space. Now we examine 2 solutions NFS or GFS2 via (Fibre Channel storage). Can someone help to make decision? What kind of storage solution we can use to achieve good performance and scalability.
This greatly depends upon whose cluster NFS storage product we're talking about.
I remember people complaining about GFS2 (and other cluster filesystems) having bad performance. But in any case whatever you use, be sure to use also http://wiki2.dovecot.org/Director Even if it's not strictly needed, it improves the performance with GFS2.
GFS2 and OCFS2 performance is suffers when using maildir due to the filesystem metadata being broadcast amongst all nodes, thus creating high latency and low metadata IOPS. The more nodes the worse this problem becomes. If using old fashioned UNIX mbox, or Dovecot mdbox with a good deal of emails per file then this isn't as much of an issue as metadata changes are few. If using maildir with a cluster filesystem using a small number of fat nodes is mandatory to minimize metadata traffic.
If using fiber channel SAN storage with your cluster filesystem, keep in mind that one single port 8gb HBA and its SFP transceiver costs significantly more than a 1U server. Given that a single 8gb FC port can carry (800MB/s / 32KB)= 25,600 emails per second, or 2.2 billion emails/day, fat nodes make more sense from a financial standpoint as well. Mail workloads don't require much CPU, but need low latency disk and network IO, and lots of memory. Four dual socket Opteron 8-core servers (16c per server), 128GB RAM, two single port 8gb FC HBAs with SCSI multipath, dual GbE ports for user traffic and dual GbE for GFS2 metadata, should fit the bill nicely.
Any quality high performance SAN head with 2-4 ports per dual controller, or multiple SAN heads, that can expand to 480 or more drives, is suitable. If the head has only 4 ports total you will need an FC switch with at least 8 ports, preferably two switches with minimum 4 ports each (8 is the smallest typically available)--this provides maximum redundancy as you survive a switch failure. For transactional workloads you never want to use parity as the RMW cycles that result for smaller than stripe width writes degrade write throughput by a factor of 5:1 or more compared to non-parity RAID. So RAID10 is the only game in town, thus you need lots of spindles. With 480x 600GB SAS 15K drives (4x 60 bay 4U chassis) and 16 spares you have 464 drives configured in 29 RAID10 arrays of 16 drives, 4.8TB raw per array, and yielding an optimal 8x 32KB stripe width of 256KB. You would format each 4.8TB exported LUN with GFS2, yielding 29 cluster filesystems, with ~35K user mail directories on each. If you have a filesystem problem and must run a check/repair, or even worse restore from tape or D2D, you're only affecting up to 1/29th, or 35K of your 1M users. If you feel this is too many filesystems to manage you can span arrays with the controller firmware or with mdraid/lvm2. And of course you will need a box dedicated to Director, which will spread connections across your 4 server nodes.
This is not a complete "how-to" obviously, but should give you some pointers/ideas on overall architecture options and best practices.
-- Stan