[Dovecot] Highly Performance and Availability
Hello everyone, I am currently running Dovecot as a high performance solution to a particular kind of problem. My userbase is small, but it murders email servers. The volume is moderate, but message retention requirements are stringent, to put it nicely.
Many users receive a high volume of email traffic, but want to keep every message, and *search* them. This produces mail accounts up to 14+GiB. After seeing the failures of my predecessors, I transitioned to Postfix/Dovecot and haven't looked back. Things are running nicely with the below setup.
Postfix and Dovecot running on the same virtual machine on a Dell 2950 with 1x Xeon E5440 on ESXi 4. Maildirs served up by 10x146GB 15k RPM SAS drives on RAID-10 via direct attached Dell MD-1000.
We are transitioning other services to high availability, and I'm wondering exactly how to provide some kind of near-realtime failover for my Postfix/Dovecot machine. The MD-1000 provides nothing in the way of iSCSI, but it *does* have two SAS connections available, only one of which is in use.
I have been looking at the Dell EqualLogic stuff and it seems to provide what we need. I can get most of the information I need from the rep, but I wonder if anyone has any experience with high performance requirements on these kinds of storage.
I'd like to continue running my current hardware as the primary mail server, but provide some kind of failover using the SAN. The primary usage of the SAN will be to make our 2TB document store highly available. I'm wondering what kind of options I might have in the way of piggybacking some email failover on this kind of hardware without sacrificing the performance I'm currently enjoying.
Is it possible to go with a virtual machine mounted on iSCSI acting as a backup mail server? How would I sync the two, NBD+MD? Any experience doing this with maildirs? I wonder about the performance.
Can it be as simple as attaching my MD-1000's second controller to the SAN magic box via SAS and pressing the Easy button?
Is it as expensive as running my primary mailserver mounted from the SAN via Fiber Channel? Will that get me under 30ms latency?
I welcome any suggestions the group may have.
-- Wayne Thursby System Administrator Physicians Group, LLC
Am Dienstag 16 Februar 2010 06:42:57 schrieb Wayne Thursby:
We are transitioning other services to high availability, and I'm wondering exactly how to provide some kind of near-realtime failover for my Postfix/Dovecot machine. The MD-1000 provides nothing in the way of iSCSI, but it *does* have two SAS connections available, only one of which is in use.
Keep away from the MD-1000/3000 stuff unless you're running SLES or RHEL.
-- Mit freundlichen Grüßen / Best Regards Dominik
Dominik Schulz put forth on 2/16/2010 3:33 AM:
Am Dienstag 16 Februar 2010 06:42:57 schrieb Wayne Thursby:
We are transitioning other services to high availability, and I'm wondering exactly how to provide some kind of near-realtime failover for my Postfix/Dovecot machine. The MD-1000 provides nothing in the way of iSCSI, but it *does* have two SAS connections available, only one of which is in use.
Keep away from the MD-1000/3000 stuff unless you're running SLES or RHEL.
One quick thing to add: The two ports on the MD-1000 are there for two reasons. You can split the backplane into two halves. One panel connector then services 7 drives on one side, the other connector services the other 7 drives.
This allows two separate hosts to control 7 drives each, one port/cable to each host.
You can connect two RAID controllers on one host to 7 drives on each controller. This config is meant as an optimization technique for extremely high bandwidth applications, specifically such as those requiring RAID5 in which case a single RAID controller may run out of internal parity processing power before maximizing disk throughput, having the disks running at less than maximum capacity. Another scenario is RAID0 (striping) across 14 fast drives overwhelming the throughput of a single RAID card. Again, splitting the drives across two cards can help alleviate this problem.
-- Stan
Wayne Thursby put forth on 2/15/2010 11:42 PM:
Hello everyone,
Note domain in my email addy Wayne. ;)
I have been looking at the Dell EqualLogic stuff and it seems to provide what we need. I can get most of the information I need from the rep, but I wonder if anyone has any experience with high performance requirements on these kinds of storage.
EqualLogic has nice iSCSI SAN storage arrays with internal multiple snapshot ability and what not, but IMHO they're way over priced for what you get.
I'd like to continue running my current hardware as the primary mail server, but provide some kind of failover using the SAN. The primary usage of the SAN will be to make our 2TB document store highly available. I'm wondering what kind of options I might have in the way of piggybacking some email failover on this kind of hardware without sacrificing the performance I'm currently enjoying.
Give me the specs on your current SAN setup and I'll give you some good options.
- What/how many FC switches do you have, Brocade, Qlogic, etc?
- What make/model is your current SAN array controller(s), what disk config?
Is it possible to go with a virtual machine mounted on iSCSI acting as a backup mail server? How would I sync the two, NBD+MD? Any experience doing this with maildirs? I wonder about the performance.
This isn't the way to go about it. You already have an FC SAN and VMware ESX. ESX+SAN is _THE_ way to do HA/failover with Vmotion. I haven't use it since ESX3, but I must say, there is no better solution available on the planet. It's nearly perfect.
Can it be as simple as attaching my MD-1000's second controller to the SAN magic box via SAS and pressing the Easy button?
No. The MD-1000 is direct attached storage, i.e. dumb storage, and you have it configured with a hardware RAID controller in a single host. You can't share it with another host. To share storage arrays between/among multiple hosts requires an intelligent controller in the array chassis doing the RAID, multiple host port connections (FC, SCSI, iSCSI), and a cluster filesystem on the hosts to coordinate shared fs access and file locking. This is exactly what ESX does with multiple ESX hosts and a SAN array.
Is it as expensive as running my primary mailserver mounted from the SAN via Fiber Channel? Will that get me under 30ms latency?
I'm not sure what you mean by "expensive" in this context. Also, the latency will be dependent on the SAN storage array(s) and FC network. From experience, it is typically extremely low, adding an extra few milliseconds to disk access time, from less than 1ms with low load to maybe 3-5ms for a loaded good quality SAN array. This is also somewhat dependent on the number of FC switch hops between the ESX hosts and the SAN array box--the more switch hops in the chain, the greater the FC network latency. That said, the greatest latency is going to be introduced by the SAN storage controllers (those smart circuit boards inside the SAN disk boxen that perform the RAID and FC input/output functions.
To give you an idea of the performance you can get from an FC SAN and a couple of decent storage arrays, I architected and implemented a small FC SAN for a 500 user private school. I had 7 blade servers, 2 ESX, 4 Citrix, and one Exchange server, none with local disk. Everything booted and ran from SAN storage, the VMs and all their data, the Exchange server and its store, the Citrix blades, everything.
We had about 20 VMs running across the two ESX blades and I could vmotion just about any VM guest server, IN REAL TIME, from one ESX blade server to the other, in less than 5 seconds. Client network requests were never interrupted. Vmotion is freak'n amazing technology. Anyway, our total CPU and SAN load over the entire infrastructure averaged about 20% utilization. The VMs included two AD DCs, a MS SQL server, Windows file/print servers, myriad SuSE VMs one running a 400GB iFolder datastore (think network file shares on steroids, fully synchronized roaming laptop filesystems sync'd in real time over the network or internet to the iFolder data store), a Novell ZEN server for SuSE Linux workstation push/pull updates and laptop imaging, a Moodle Php/MySQL based course management system with a 50GB db, a Debian syslog collector, etc, etc. Most of the VMs' bot disk images resided on an IBM FasTt 600 array with 14 x 73GB 15Krpm disks in two RAID 5 arrays, one with 6 disks, one with 7, and one hot spare, only 128MB write cache. The 4 Citrix blades used FasTt LUNs for their local disks, and the Exchange server booted from a FasTt LUN and had its DB stored on a FasTt LUN. All other data storage including that of all the VMs resided on a Nexsan Satablade SAN storage array consisting of 8 x 500GB 7.2Krpm disks configured in a RAID5 set, no spares, 512MB write cache. The Bladecenter had an inbuilt 2 port FC switch. I uplinked these two ports via ISL to an 8 port Qlogic 2Gb FC switch. I had one 2 Gb FC link from the FasTt into the Qlogic switch and two 2 Gb links from the Satablade into the switch. For 500 users and every disk access going to these two SAN arrays, the hardware was actually overkill for current needs. But, it had plenty of headroom for spikes and future growth, both in terms of throughput, latency, and storage capacity.
I ran an entire 500 user environment, all systems, all applications, on two relatively low end FC SAN boxen, and you're concerned about the performance of a single mail SMTP/IMAP server over a SAN? I don't think you need to worry about performance, as long as all is setup correctly. ;)
To do this properly, you'll need a second Dell Server with FC HBA and an FC HBA for the existing server, ESX vmotion and HA options, which I'm not sure are available for ESXi. You may have to upgrade to ESX, which as you know has some pricey licensing. But, it's worth the cost just for vmotion/HA.
You'll export a SAN LUN of sufficient size (500GB-1TB) to cover the IMAP store needs from one of the SAN storage arrays to the WWNs of the HBAs in both of your two ESX hosts, and you'll add that LUN to the ESX storage pool as a raw LUN. Do NOT make it a VMFS volume. It's going to be huge, and you're only storing data on it, not virtual machines. VMFS volumes are for virtual machine storage, not data storage. Performance will suffer if you put large data in VMFS volumes. I cannot stress this enough. For HA and vmotion to work, you'll also need to export a small SAN LUN (20GB) to both ESX hosts' FC WWNs, format it as a VMFS, and you'll move the Postfix/Dovecot virtual machine to that ESX VMFS volume. I'm assuming you have the Postfix spool in the same VMFS volume as the boot and root filesystems. This will allow both ESX hosts to boot and run the VM and enables vmotion and HA. (I sincerely hope you don't currently have the VM files and data store for your current IMAP store all in a single VMFS volume. That's horrible ESX implementation and will make this migration a bear due to all the data shuffling you'll have to do between partitions/filesystems, and the fact you'll probably have to shut down the server during the file moving).
You may need to add a soft zone to your FC switches containing the WWNs of the ESX host HBAs and the WWN(s) of the SAN storage array ports you're exporting the LUNs through before you'll see the exposed LUNs on the arrays. Once you have ESX, vmotion, and HA running on both ESX machines, all you have to do is enable HA failover for the Postfix/Dovecot VM. If the ESX host on which it's running, or the VM guest, dies for any reason, the guest will be auto restarted within seconds on the other ESX host. This happens near instantaneously, and transparently, because both hosts have "local disk" access to the same .vmdk files and raw data LUN on the SAN arrays. Clients probably won't even see an error during the failover as IMAP clients reconnect and login automatically. The name and IP address of the servers stays the same, the underlying server itself, all it's config and spool files, metadata files, everything is identical to before the crash. It's just running on a different physical ESX machine. This capability, more than anything else, is what makes VMware ESX worth the licensing costs. Absolutely seamless fault recovery. If an organization can afford it (can any not?) it's the only way to go for x86 based systems.
I welcome any suggestions the group may have.
Unfortunately for this ESX HA architecture, your current MD-1000 isn't reusable. Direct attached storage will never work for any workable/functional HA setup. If I were you, after you migrate your VMs to the SAN such as I mention above, and obviously after you're comfortable all went as planned, I'd direct attach the MD-1000 to another server and use it as a near line network backup server or other meaningful purpose.
If your current FC SAN storage array doesn't have enough spare capacity (performance/space), you can get a suitable unit from Nesxan and other manufacturers for $10-15K in a single controller version. I personally recommend this:
http://www.nexsan.com/sataboy.php http://www.sandirect.com/product_info.php?cPath=171_208_363&products_id=1434
Get a unit with the FC+iSCSI controller, 14x500GB drives, 1GB cache. This is a standard product configuration at Sandirect. Configure a 13 drive RAID5 array with one spare. You may be balking at 7.2Krpm SATA drives and a RAID5 setup. I can tell you from experience with Nexsan's 8 drive Satablade unit (now discontinued) with "only" 512MB cache and an 8 drive RAID5, you won't come close to hitting any performance limits with your current load. This setup would likely carry 4-6 times your current load before introducing latency. Nexsan uses a PowerPC 64 chip on this controller and a very efficient RAID parity algorithm. The performance hit due to parity calculations going from RAID10 to RAID5 is about 10%, but you gain that back because your stripe width is 13 instead of 7, assuming you use all 14 drives for the RAID10 with no spares--if you use spares you must have two, since RAID10 requires an even number of disks, making your stripe width 6. Thus, with spares for each, the RAID5 stripe width is _double_ the RAID10 width.
The WEB GUI admin interface is fantastic, simple. Configure the RAID5 array, then create your initial volumes that you'll export as LUNs to your two ESX hosts' WWNs. Only connect one FC port to the FC switch, and expose volumes to both ESX hosts out the same port with the same LUN. This is critical. Both ESX hosts need to see the same info or you will break things. If you balk again thinking a single 2Gb/4Gb FC link won't be fast enough, you'd be wrong. In addition, if you want to use both ports, you must have dual FC adapters in each ESX host, and you must expose all LUNs out BOTH Sataboy ports to both FC WWNs on each ESX host. You then have to setup ESX FC multipathing, which IIRC is another additional licensing fee, although I'm not positive on that. To add insult to injury, AFAIK, Nexsan isn't an ESX certified SAN vendor, so if you run into problems getting the multipathing to work, VMware techs probably won't help you. As of 2006 they weren't certified, might be today, not sure. All their gear is fully FC compliant, but I guess they never felt like paying the "VMware tax".
This is probably way too much less than optimally written/organized information for the list, and probably a shade OT. I'd be more than glad to continue this off list with anyone interested in FC SAN stuff. I've got some overly aggressive spam filters, so if I block a direct email, hit postmaster@ my domain and I'll see it.
-- Stan
Stan Hoeppner spake:
Wayne Thursby put forth on 2/15/2010 11:42 PM:
I have been looking at the Dell EqualLogic stuff and it seems to provide what we need. I can get most of the information I need from the rep, but I wonder if anyone has any experience with high performance requirements on these kinds of storage.
EqualLogic has nice iSCSI SAN storage arrays with internal multiple snapshot ability and what not, but IMHO they're way over priced for what you get.
I was planning on using EqualLogic because the devices seem competent, and we already have an account with Dell. Also, being on VMware's HCL is important as we have a support contract with them.
I'd like to continue running my current hardware as the primary mail server, but provide some kind of failover using the SAN. The primary usage of the SAN will be to make our 2TB document store highly available. I'm wondering what kind of options I might have in the way of piggybacking some email failover on this kind of hardware without sacrificing the performance I'm currently enjoying.
Give me the specs on your current SAN setup and I'll give you some good options.
- What/how many FC switches do you have, Brocade, Qlogic, etc?
- What make/model is your current SAN array controller(s), what disk config?
Here's where I think you misunderstood me. I have no SAN at the moment. I'm running a monolithic Postfix/Dovecot virtual machine on an ESXi host that is comprised of a Dell 2950 directly attached via SAS to a Dell MD-1000 disk array. We have no Fiber Channel anything, so going that route would require purchasing a full compliment of cards and switches.
[ skipping dead end questions ]
Is it as expensive as running my primary mailserver mounted from the SAN via Fiber Channel? Will that get me under 30ms latency?
I'm not sure what you mean by "expensive" in this context.
Simply that purchasing FC cards and switches adds to the cost, wheras we already have GbE for iSCSI.
I ran an entire 500 user environment, all systems, all applications, on two relatively low end FC SAN boxen, and you're concerned about the performance of a single mail SMTP/IMAP server over a SAN? I don't think you need to worry about performance, as long as all is setup correctly. ;)
I hope that is correct, thank you for sharing your experiences. I inherited a mail system that had capable hardware but was crippled by bad sysadmin-ing, so I'm trying to make sure I'm going down the right path here.
My main concern is when Dovecot tries to run a body search on an inbox with 14,000 emails in it, that the rest of the users don't experience any performance degradation. This works beautifully in my current setup, however the MD-1000 is not supported by VMWare, doesn't do vMotion, etc, etc. It sounds like I have nothing to worry about if I go with Fiber Channel, any idea about iSCSI?
enables vmotion and HA. (I sincerely hope you don't currently have the VM files and data store for your current IMAP store all in a single VMFS volume. That's horrible ESX implementation and will make this migration a bear due to all the data shuffling you'll have to do between partitions/filesystems, and the fact you'll probably have to shut down the server during the file moving).
My current disk layout is as follows: Filesystem Size Used Avail Use% Mounted on /dev/sda1 9.5G 4.2G 4.8G 47% / /dev/sdb1 199G 134G 55G 71% /var/vmail /dev/sdc1 20G 13G 6.8G 65% /var/sdc1 /dev/sdd1 1012M 20M 941M 3% /var/spool/postfix
/dev/sda1 is a regular VMWare disk. The other three are independent persistent disks so that I can snapshot/restore the VM without destroying the queue or stored email.
/var/vmail = maildirs on RAID-10 /var/sdc1 = virusmails from Amavis on RAID-5 /var/spool/postfix = Postfix's spool on RAID-10
You certainly clarified a number of things for me by detailing your past setup. I suppose I should clarify exactly what the current plan is.
We are migrating a number of other services to some kind of an HA setup using VMWare and vMotion, that much has been decided. My primary decision centers around choosing either iSCSI or Fiber Channel. We have *no* Fiber Channel infrastructure at the moment, so this would add significantly to the price of our setup (at least 2 cards + switch).
The other applications we are virtualizing are nowhere near as disk i/o intensive as our email server, so I feel confident that an iSCSI SAN would meet all performance requirements for everything *except* the email server.
I'm really looking for a way to get some kind of redundancy/failover for Postfix/Dovecot using just iSCSI, but without killing the performance I'm experiencing using direct attached storage, but it sounds like you're saying I need FC.
This is probably way too much less than optimally written/organized information for the list, and probably a shade OT. I'd be more than glad to continue this off list with anyone interested in FC SAN stuff. I've got some overly aggressive spam filters, so if I block a direct email, hit postmaster@ my domain and I'll see it.
Well, I've got the rest of my virtual infrastructure/SAN already figured out, so my questions are centering around providing redundancy for Dovecot/maildirs. I think you've answered all of my hardware questions (ya' freak). It really seems like Fiber Channel is the way to go if I want to have HA maildirs.
I just don't know if I can justify the extra cost of a FC infrastructure just because a single service would benefit, especially if there's a hybrid solution possible, or if iSCSI was sufficient, thus my questions for the list.
Has anyone attempted to run sizable maildirs over a GbE iSCSI SAN?
-- Wayne Thursby System Administrator Physicians Group, LLC
Wayne Thursby put forth on 2/16/2010 9:42 AM:
I was planning on using EqualLogic because the devices seem competent, and we already have an account with Dell. Also, being on VMware's HCL is important as we have a support contract with them.
Using the standard Gbe ports on the servers won't work, for two basic reasons:
This would require using the ESX iSCSI initiator which isn't up to the task, as it sucks too many CPU cycles under intense disk workloads, stealing those cycles from the guests and their applications, which, coincidentally are causing the big disk I/O workload.
Gbe iSCSI has a maximum raw signaling rate of 125MB/s, 100MB/s after TCP overhead. This is less than a single 15K rpm SAS disk. And you'll have 14 of those in the array. That spells Bottleneck with a CAPS B, a 14:1 bottleneck. It's just not suitable for anything but low demand file transfers or very low transaction databases.
Here's good news. I just looked, and most of Nexsan's SAN arrays are now VMware Ready certified, including all the ones I talk about here:
http://alliances.vmware.com/public_html/catalog/PublicCatalog.php
Here's where I think you misunderstood me. I have no SAN at the moment. I'm running a monolithic Postfix/Dovecot virtual machine on an ESXi host that is comprised of a Dell 2950 directly attached via SAS to a Dell MD-1000 disk array. We have no Fiber Channel anything, so going that route would require purchasing a full compliment of cards and switches.
Yes, I did misunderstand. My apologies. The way you worded you previous post led me to believe your organization had a small SAN being used for other things, and that you were consolidating some other applications to that SAN storage and were thinking of moving some of this VMware stuff onto it. I'm clear now that this isn't the case.
I did, however, fully understand what your current ESXi SMTP/IMAP server platform is and what you want to achieve moving forward.
Is it as expensive as running my primary mailserver mounted from the SAN via Fiber Channel? Will that get me under 30ms latency?
Without actually testing the iSCSI solution I can't state the latency. But, there is no doubt latency is going to be an order of magnitude higher with Gbe iSCSI than with 4Gb FC especially under high load. Make that 2-3 orders of magnitude higher if using software initiators. I can tell you that round trip latency of an FC block request from HBA through Qlogic switch to Nexsan array and back will be less than 10ms, and over 90% of that latency is the disk head reads, which you'll obviously have with any SAN. The magic is the low overhead of FC. With 1Gbe iSCSI, half or more of the total latency will be in the ethernet network and TCP processing.
I'm not sure what you mean by "expensive" in this context.
Simply that purchasing FC cards and switches adds to the cost, wheras we already have GbE for iSCSI.
As I stated above, 1Gbe ethernet with a software initiator is woefully inadequate for your needs. Using 1Gbe iSCSI HBAs would help slightly, 10-20% maybe, but you're still shackled with a maximum 100MB/s data rate. Again, that's slower than a single 15K SAS drive. That's not enough bandwidth for your workload, if I understand it correctly.
I ran an entire 500 user environment, all systems, all applications, on two relatively low end FC SAN boxen, and you're concerned about the performance of a single mail SMTP/IMAP server over a SAN? I don't think you need to worry about performance, as long as all is setup correctly. ;)
I hope that is correct, thank you for sharing your experiences. I inherited a mail system that had capable hardware but was crippled by bad sysadmin-ing, so I'm trying to make sure I'm going down the right path here.
You're welcome. There is no "hope" involved. It's just fact. These Nexsan controllers with big cache and fast disks can easily pump 50K random IOPs to cache and 2,500+ through to disk. They really are beasts. You would have to put 5-10X your current workload, including full body searches, through one of these Nexsan units before you'd come close to seeing any lag due to controller or disk bottle necking.
My main concern is when Dovecot tries to run a body search on an inbox with 14,000 emails in it, that the rest of the users don't experience any performance degradation. This works beautifully in my current setup, however the MD-1000 is not supported by VMWare, doesn't do vMotion, etc, etc. It sounds like I have nothing to worry about if I go with Fiber Channel, any idea about iSCSI?
Like I said, you'd have to go with 10Gbe iSCSI with HBAs and a 10Gbe switch to meet your needs. 1Gbe sotware initiator iSCSI will probably fall over with your workload, and your users will very likely see latency effects. And, as I said, due to this fact your costs will be far greater than the FC solution I've outlined.
My current disk layout is as follows: Filesystem Size Used Avail Use% Mounted on /dev/sda1 9.5G 4.2G 4.8G 47% / /dev/sdb1 199G 134G 55G 71% /var/vmail /dev/sdc1 20G 13G 6.8G 65% /var/sdc1 /dev/sdd1 1012M 20M 941M 3% /var/spool/postfix
/dev/sda1 is a regular VMWare disk. The other three are independent persistent disks so that I can snapshot/restore the VM without destroying the queue or stored email.
It's been a while since I worked with the VMware ESX GUI. Suffice it to say that each LUN you expose on the Nexsan will appear to ESX as a big SCSI disk, which you can use as VMFS to store guests, or you can assign it as a raw LUN ("raw device mapping" I think was official VMware jargon) to a particular guest. You've probably got more ESX experience at this point than I do. At the very least your experience is fresh, and mine is stale, back in the 3.0 days. I recall back in the day there were a couple of "gotchas", where if you chose one type of configuration for a LUN (disk) then you couldn't use some of the advanced backup/snapshot features. There were some trade offs one had to make. Man, it's been so long lol. Read the best practices and all the VMware info you can find on using fiber channel SANs with ESX. Avoid any gotchas WRT HA and snapshots.
You certainly clarified a number of things for me by detailing your past setup. I suppose I should clarify exactly what the current plan is.
We are migrating a number of other services to some kind of an HA setup using VMWare and vMotion, that much has been decided. My primary decision centers around choosing either iSCSI or Fiber Channel. We have *no* Fiber Channel infrastructure at the moment, so this would add significantly to the price of our setup (at least 2 cards + switch).
Nah, they're cheap, I'd say maybe $4K total. Lets see...
http://www.cdw.com/shop/products/default.aspx?EDC=1836712 http://www.qlogic.com/SiteCollectionDocuments/Education_and_Resource/Datashe...
http://www.cdw.com/shop/products/default.aspx?EDC=926795 http://download.qlogic.com/datasheet/42737/Datasheet%20-%20QLE2440%20%5BD%5D... http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/SearchByProduct.aspx?ProductCategory=39&Product=937&Os=167
http://www.cdw.com/shop/products/default.aspx?EDC=1021715 Get 4 SFP LC transceivers (always have a spare) You'll populate 3 switch ports with these, plugging ESX servers into two of them and FC port 0 on the Nexsan into the other. With these products you'll have end to end 4 Gb/s links, 800 MB/s total throughput per switch link--400MB/s full duplex per link.
So, lets see how close my guesstimate was:
1 x QLogic SANbox 3810, 8 x 8/4/2 Gb/s FC switch $1,880 2 x QLogic SANblade QLE2440 - host bus adapter $ 790 4 x IBM 4Gbps SW SFP Transceiver $ 140 Total: $4,020
Yep, about $4K. I under estimated by $20, but then again, CDW isn't the cheapest vendor by far, but I used them as an example as I knew they carried all this stuff. They carry all the Nexsan arrays as well, but unfortunately, just like everyone else, for SAN products in this price range you have to call to get a quote. Get yourself quotes from CDW and SANDirect.com on these standard factory configurations:
Nexsan SASBoy, 2 FC, 2 iSCSI, 2GB cache, 14 x 300GB 15K SAS drives Nexsan SATABoy, 2 FC, 2 iSCSI, 1GB cache, 14 x 500GB 7.2K SATA drives
The first will give you more performance than you can imagine, and will allow for 10 years of performance growth, though at 4.2 raw TB, you may run out of space before 10 years. Depends on if you store digital xrays etc on it. These arrays would really shine in this application BTW. Nexsans have won multiple performance awards for their streaming, although their random I/O is fantastic as well.
The other applications we are virtualizing are nowhere near as disk i/o intensive as our email server, so I feel confident that an iSCSI SAN would meet all performance requirements for everything *except* the email server.
One key point that you are failing to realize is that the advanced storage and backup features of ESX itself demands high bandwidth low latency access to the SAN storage arrays. Snapshots, backup, etc. VMware snapshots will fill FC links to capacity until completed, unless you lower their priority (not sure it that's possible). Anyway, if you want/need to use any of ESX's advanced capabilities, 1Gbe iSCSI isn't going to cut it. We had 2Gb FC, and some operations I performed had to be done at night or one weekends because they filled the SAN pipes. You may run into that even with 4Gb FC. And if you do, you can pat yourself on the back for going FC, as 1Gbe iSCSI would take over 4 times as long to complete the same storage operation. :)
I'm really looking for a way to get some kind of redundancy/failover for Postfix/Dovecot using just iSCSI, but without killing the performance I'm experiencing using direct attached storage, but it sounds like you're saying I need FC.
To maintain the level of I/O performance you currently have, but in a SAN environment which allows VMware magic, you will require either an FC SAN or 10Gbe iSCSI SAN. The 10Gbe iSCSI solution will probably be almost twice the total $price, will be more difficult to setup and troubleshoot, and will have no more, and likely less, total performance than the 4Gb FC solution.
Well, I've got the rest of my virtual infrastructure/SAN already figured out, so my questions are centering around providing redundancy for Dovecot/maildirs. I think you've answered all of my hardware questions (ya' freak). It really seems like Fiber Channel is the way to go if I want to have HA maildirs.
It's not just maildirs you're making HA but the entire Linux guest server, or all your VM guests if you want. All ESX servers connected to shared SAN storage can start and run any VM guest in the environment residing on those SAN LUNs and can access any raw device mappings (raw LUNs) associated with a VM. This is also what makes vmotion possible. It's incredible technology really. Once you start getting a good grasp on what VMware ESX, Vmotion, HA, Snapshots, etc can really do for you, you'll start buying additional machines and ESX licenses, and you'll end up consolidating every possible standalone server you have onto VMware. The single largest overriding reason for this is single point backup and disaster recovery.
With consolidated backup, and a large enough tape library system, it's possible to do a complete nightly backup of your entire VMware environment including all data on the SAN array(s), and rotate the entire set of tapes off site for catastrophic event recovery, for things such as fire, earthquake, flood, etc. In the immediate aftermath, you can acquire one big machine with appropriate HBAs, an identical SAN array, switch, tape library, etc, and restore the entire system in less than 24 hours, bringing up only critical VMs until you're able to get more new machines in and setup. The beauty of ESX is that there is nothing to restore onto all the ESX hosts. All you do is a fresh install of ESX and configure it to see the SAN LUNs. You can have a copy of the ESX host configuration files sitting on the SAN, and thus in the DR backup.
Normally this is done in a temporary data center colocation facility with internet access so at minimum principals within the organization (CEO, CFO, VPs, etc) can get access to critical information to start rebuilding the organization. This is all basic business continuity 101 stuff, so I won't go into more detail. The key point is that with ESX, an FC SAN, a tape library and consolidated backup, the time to get an organization back up and running after a catastrophe is cut from possibly weeks to a couple of days, most of that time being spent working with insurance folk and waiting on the emergency replacement hardware to arrive.
There is no replacement for off site tape but a hot/standby remote datacenter, and most can't afford that. Thus, one needs a high performance high capacity tape library/silo. Doing consolidated backup of one's VMware environment requires fast access to the storage. 1Gbe iSCSI is not even close to appropriate for this purpose. Case in point: you have 4Gb of VMs and data LUNs on your array. If you can get 100% of the iSCSI Gbe bandwidth for consolidated backup--which you can't because the VMs are going to be live at the time, and you can't get 100% out of Gbe anyway due to TCP--it'll would take 11 hours to backup that 4TB as it all has to come off the array via 1Gbe iSCSI. If you have 4Gb FC it would cut that time x 4, that 11 hours becoming a little under 3 hours. An 11 hour backup window is business disruptive, and makes it difficult to properly manage an off site backup procedure (which everyone should have).
I just don't know if I can justify the extra cost of a FC infrastructure just because a single service would benefit, especially if there's a hybrid solution possible, or if iSCSI was sufficient, thus my questions for the list.
I covered the costs above, and again, FC beats iSCSI all around the block and on Sunday, unless Equallogic has dropped their prices considerably since 2006. If by hybrid you mean a SAN array that supports both 4Gb FC and 1Gbe iSCSI, all of the Nexsan units fit that bill with two 4Gb FC and two 1Gbe iSCSI ports per controller.
Sorry this is so freak'n long. Hardware is my passion, and I guess verbosity is my disease. Hope the info helps in one way or another. If nothing else it may put you to sleep faster than a boring book. ;)
-- Stan
I think Stan pretty much covered how to do this stuff *properly*, however, for those following along in the bedroom, there are a couple of interesting projects what might get you some of the ESX features (surely at the expense of far more support and likely reliability, but needs always vary...)
Note, I have no experience with any of these projects, they simply caught my eye for further research...
- Latest KVM+QEMU includes some of the desirable ESX features including hot migration
- Apparently Redhat have a nice management utility for this
- Or try ProxMox: http://pve.proxmox.com/wiki/Main_Page
(cheap) High availability storage seems to come down to:
- iSCSI
- Add redundancy to the storage using DRDB (I believe a successful strategy with Dovecot is pairs of servers, replicated to each other - run each at 50% capacity and if one dies the other picks up the slack)
- Interesting developing ideas are: PVFS, GlusterFS (they have an interesting "appliance" which might get reliability to production levels?), CEPH (reviews suggest it's very easily days)
None of these solutions gets you an "enterprise" or proper high end solution as described by Stan, but may give some others some things to investigate
Cheers
Ed W
Ed W put forth on 2/17/2010 12:25 PM:
I think Stan pretty much covered how to do this stuff *properly*,
At least for a VMware ESX + SAN environment, yes.
however, for those following along in the bedroom, there are a couple of interesting projects what might get you some of the ESX features (surely at the expense of far more support and likely reliability, but needs always vary...)
Surely. I hate the licensing cost of VMware ESX and the options. Also, the first time I was told about VMware ESX I was extremely skeptical. Once I started using it, and built out a SAN architecture under it, I was really, really impressed by what it can do, and its management capabilities. It will be a long time until a FOSS equivalent even comes close to its performance, reliability, capability, and ease of management. It really is a great solution. HA, Consolidated Backup, and a couple of other technologies are what really make this an enterprise solution, providing near 24x7x365 uptime and rapid redeployment of an infrastructure after catastrophic loss of the datacenter.
Note, I have no experience with any of these projects, they simply caught my eye for further research...
- Latest KVM+QEMU includes some of the desirable ESX features including hot migration
- Apparently Redhat have a nice management utility for this
I'll have to look into this.
- Or try ProxMox: http://pve.proxmox.com/wiki/Main_Page
(cheap) High availability storage seems to come down to:
- iSCSI
1Gbe iSCSI is great for targeted applications on moderate load SANs. With any kind of heavy lifting, you need either 10Gbe iSCSCI or Fiber Channel. Both of those are a bit more expensive, and 10Gbe iSCSI usually costing quite a bit more than FC because of the switch and HBA costs. Either is suitable for an HA SAN with live backup. 1Gbe iSCSI is not--simply too little bandwidth and too much latency.
- Add redundancy to the storage using DRDB (I believe a successful strategy with Dovecot is pairs of servers, replicated to each other - run each at 50% capacity and if one dies the other picks up the slack)
DRDB is alright for a couple of replicated hosts with moderate volume. If you run two load balanced hot hosts with DRDB, and your load increases to the point you need more capacity, a 3rd hot host, expanding with DRDB gets a bit messy. With an iSCSI or FC SAN you merely plug in a 3rd host, install and configure the cluster FS software, expose the shared LUN to the host, and basically you're up and running in little time. All 3 hosts share the exact same data on disk, so you have no replication issues, no matter how many systems you stick into the cluster. The only limitation is the throughput of your SAN array.
- Interesting developing ideas are: PVFS, GlusterFS (they have an interesting "appliance" which might get reliability to production levels?), CEPH (reviews suggest it's very easily days)
GlusterFS isn't designed as a primary storage system for servers or server clusters. A good description of it would be "cloud storage". It is designed to mask, or make irrelevant, the location of data storage devices and the distance to them. Server and datacenter architects need to know the latency characteristics and bandwidth of storage devices backing the servers. GlusterFS is the antithesis of this.
None of these solutions gets you an "enterprise" or proper high end solution as described by Stan, but may give some others some things to investigate
"Enterprise" capability, performance, and reliability don't necessarily have to come with an "Enterprise" price tag. ;)
Eric Rostetter is already using GFS2 over DRDB with two hot nodes. IIRC he didn't elaborate a lot on the performance or his hardware config. He seemed to think the performance was more than satisfactory.
Eric, can you tell us more about your setup, in detail? I promise I'll sit quiet and just listen. Everyone else may appreciate your information.
-- Stan
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
- Add redundancy to the storage using DRDB (I believe a successful strategy with Dovecot is pairs of servers, replicated to each other - run each at 50% capacity and if one dies the other picks up the slack)
DRDB is alright for a couple of replicated hosts with moderate volume.
Not sure how you define "moderate" load... Seems like in a 2 node cluster it does a nice job for fairly high load, as long as it is setup correctly. Kind of like what you say about the SAN though, the faster the DRBD interconnect, the better it can handle the load (100Mb, 1Gb, 10Gb, other methods, etc).
If you run two load balanced hot hosts with DRDB, and your load increases
to the point you need more capacity, a 3rd hot host, expanding with DRDB gets a bit messy.
Very much so... I'm running GFS on them, and if I need to add more hosts I'll probably do it via GNBD instead of adding more DRBD connections... Growing by adding more DRBD doesn't seem desirable in most cases, but growing by sharing the existing 2 DRBD machines out (NFS, GNBD, Samba, iSCSI, etc) seems easy, and if the additional machines don't need to raw disk speed it should work fine. If the new machines need the same raw disk speed, well, then you either are going to have to do a complex DRBD setup, or go with a more proper SAN setup.
With an iSCSI or FC SAN you merely plug in a 3rd host, install and
configure the cluster FS software, expose the shared LUN to the host, and
basically you're up and running in little time.
Not much different in effort/complexity than my solution of using GFS+GNDB to grow it... But surely better in terms of disk performance to the newly added machine...
RedHat claims GNBD scales well, but I've not yet been able to prove that.
All 3 hosts share the exact same data on disk, so you have no replication issues
If you have no replication issues, you have a single point of failure... Which is why most SAN's support replication of some sort...
no matter how many systems you stick into the cluster. The only limitation is the throughput of your SAN array.
Or licensing costs in some cases...
Eric Rostetter is already using GFS2 over DRDB with two hot nodes. IIRC he didn't elaborate a lot on the performance or his hardware config.
He seemed to think the performance was more than satisfactory.
I've posted the hardware config to the list many times in the past...
The performance is very good, but due to price restrictions it is not great. That is because the cost of building it with 15K SAS drives was 3x the cost of using SATA drives, so I'm stuck with SATA drives... And the cost of faster CPU's would have pushed it over budget also...
The SATA drives are okay, but will never give the performance of the SAS drives, and hence my cluster is not what I would call "very fast". But it is fast enough for our use, which is all that matters. If we need in the future, we can swap the SATA out for SAS, but that probably won't happen unless the price of SAS comes way down, and/or capacity goes way up...
Eric, can you tell us more about your setup, in detail? I promise I'll sit quiet and just listen. Everyone else may appreciate your information.
I have two clusters... One is a SAN, the other is a mail cluster. I'll describe the Mail cluster here, not the SAN. They are the same exact hardware except for the (number, size, configuration) of disks...
I get educational pricing, so your costs may vary, but for us this fit the budget and a proper SAN didn't.
2 Dell PE 2900, dual quad-core E5410 Xeons at 2.33 GHz (8 cores), 8GB RAM, Perc 6/i Raid Controller, 8 SATA disks (2 RAID-1, 4 RAID 10, 1 JBOD, and 1 Global Hot Spare), 6 1Gb nics (we use nic bonding so the mail connections use one bond pair, and the DRBD traffic uses another bond pair... the other two are for clustering and admin use).
Machines mirror shared GFS2 storage with DRBD. Local storage is ext3.
OS is CentOS 5.x. Email software is
sendmail+procmail+spamassassin+clamav, mailman, and of course dovecot.
Please don't flame me for using sendmail
instead of your favorite MTA...
The hardware specs are such that we intend to use this for about 10 years... In case you think that is funny, I'm still running Dell PE 2300 machines in production here that we bought in 1999/2000... We get a lot of years from our machines here...
We have a third machine in the cluster acting as a webmail server (apache, Horde software). It doesn't share any storage though, but it is part of the cluster (helps with split-brain, etc). It is a Dell PE 2650 with dual 3.2 Ghz Xeons, 3GB RAM, SCSI with Software Raid also running CentOS 5.
Both of the above machines mount home directories off the NAS/SAN I mentioned. So the webmail only has the OS and stuff local, the Mail cluster has all the inboxes and queues local (but not other folders), and the NAS/SAN has all the home directories (which includes mail folders other than the INBOX). This means in effect the INBOX is much faster than the other folders, which meets are design criteria (we needed fast processing of incoming mail, fast INBOX access, but other folder access speed wasn't considered critical).
The mail cluster is active-active DRBD. The NAS/SAN cluster is active-passive DRBD. That means I can take mail machines up and down without anyone noticing (services migrate with only about a 1 second "pause" for a user hitting it at the exact moment), but to take the active NAS/SAN node down results in a longer "pause" (usually 15-30 seconds) from the user's perpective while the active node hands things off to the standby node...
The NAS/SAN was my first DRBD cluster, so active-passive was easy to keep it simple and easy. The mail cluster was my second one, so I had some experience and went active-active.
-- Stan
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Thank you to everyone who has contributed to this thread, it has been very educational.
Since my last post, I have had several meetings, including a conference with Dell storage specialists. I have also gathered some metrics to beat around.
The EqualLogic units we are looking at are the baseline models, the PS4000E. We would get two of these with 16x1TB 7200RPM SATA drives and dual controllers for a total for 4xGbE ports dedicated to iSCSI traffic.
I have sent the following information and questions to our Dell reps, but I figured I'd solicit opinions from the group.
The two servers I'm worried about are our mail server (Postfix/Dovecot) and our database server (PostgreSQL). Our mail server regularly (several times an hour) hits 1 second spikes of 1400 IOPS in its current configuration. Our database server runs aroun 100-200 IOPS during quiet periods, and spikes up to 1200 IOPS randomly, but on average every 15 minutes.
With 4xGbE ports on the each EQL device, and also keeping in mind we'll have two of those, is it reasonable to expect 1400 IOPS bursts? What if both of these servers were on the same storage and required closer to 3000 IOPS?
-- Wayne Thursby System Administrator Physicians Group, LLC
Wayne Thursby put forth on 2/19/2010 3:40 PM:
Thank you to everyone who has contributed to this thread, it has been very educational.
Since my last post, I have had several meetings, including a conference with Dell storage specialists. I have also gathered some metrics to beat around.
The EqualLogic units we are looking at are the baseline models, the PS4000E. We would get two of these with 16x1TB 7200RPM SATA drives and dual controllers for a total for 4xGbE ports dedicated to iSCSI traffic.
I have sent the following information and questions to our Dell reps, but I figured I'd solicit opinions from the group.
The two servers I'm worried about are our mail server (Postfix/Dovecot) and our database server (PostgreSQL). Our mail server regularly (several times an hour) hits 1 second spikes of 1400 IOPS in its current configuration. Our database server runs aroun 100-200 IOPS during quiet periods, and spikes up to 1200 IOPS randomly, but on average every 15 minutes.
With 4xGbE ports on the each EQL device, and also keeping in mind we'll have two of those, is it reasonable to expect 1400 IOPS bursts? What if both of these servers were on the same storage and required closer to 3000 IOPS?
The first thing you need to do Wayne is talk to your VMware rep and setup a 15-30 minute teleconference with a VMware engineer. Or, if you have a local VMware consultant/engineer, set a meet with him. You need to get their thoughts and recommendations on your goals and what you're currently looking at hardware wise to implement them.
It sounds like you're set on using the ESX software iSCSI initiator, and using 2 to 4 standard GigE ports on each of your ESXi servers in some kind of ethernet channel bonding and/or active/active multipathing setup. I cannot say for certain because I don't know the current certified configurations. BUT, my instinct based on prior experience and previous knowledge say this isn't possible, and if possible, not desirable from a performance standpoint. To do this in a certified configuration, I'm guessing you at the very least will need two single port iSCSI HBAs in each server, or one dual port iSCSI HBA in each server.
Please get the right technical answers to these questions from VMware before shooting yourself in the foot, for your sake. ;)
If it turns out you can't bond 2/4 Gbe iSCSI ports in an active/active setup, you're probably going to need to go 10Gbe iSCSI, stepping up a few models in the Equallogic lineup and stepping up to 10Gbe HBAs. The other option (a better, and cheaper one) is going 4Gb Fiber channel, as I previously mentioned.
-- Stan
On 19/02/2010 21:40, Wayne Thursby wrote:
Thank you to everyone who has contributed to this thread, it has been very educational.
Since my last post, I have had several meetings, including a conference with Dell storage specialists. I have also gathered some metrics to beat around.
The EqualLogic units we are looking at are the baseline models, the PS4000E. We would get two of these with 16x1TB 7200RPM SATA drives and dual controllers for a total for 4xGbE ports dedicated to iSCSI traffic.
I have sent the following information and questions to our Dell reps, but I figured I'd solicit opinions from the group.
The two servers I'm worried about are our mail server (Postfix/Dovecot) and our database server (PostgreSQL). Our mail server regularly (several times an hour) hits 1 second spikes of 1400 IOPS in its current configuration. Our database server runs aroun 100-200 IOPS during quiet periods, and spikes up to 1200 IOPS randomly, but on average every 15 minutes.
With 4xGbE ports on the each EQL device, and also keeping in mind we'll have two of those, is it reasonable to expect 1400 IOPS bursts? What if both of these servers were on the same storage and required closer to 3000 IOPS?
That's a LOT of IOPs for 16 disks to handle? Given you are measuring on your existing hardware which has 5-10 disks depending on read/write (RAID10) then this surely means you are trying to push more than you state and just maxing out at the disk capacity?
I have no experience, but some reading over the last few days suggests you would very much desire an Equallogic with FC if the budget is there. On the other hand buying two Dell/Supermicro machines with lots of disks and using DRBD to make each a duplicate of the other would appear to satisfy your requirements also? (perhaps cheaper, but less scalability). DRBD sounds really nice for scalability up to a certain size?
Good luck
Ed W
Hi
HA, Consolidated Backup, and a couple of other technologies are what really make this an enterprise solution, providing near 24x7x365 uptime and rapid redeployment of an infrastructure after catastrophic loss of the datacenter.
Can you tell me exactly what "Consolidated Backup" means with respect to ESX please? From the brief description on the website I'm not quite sure how it varies to say backing up the raw storage using some kind of snapshot method?
GlusterFS isn't designed as a primary storage system for servers or server clusters. A good description of it would be "cloud storage". It is designed to mask, or make irrelevant, the location of data storage devices and the distance to them. Server and datacenter architects need to know the latency characteristics and bandwidth of storage devices backing the servers. GlusterFS is the antithesis of this.
I can't disagree in terms of achieved performance because I haven't tested, but in terms of theoretical design it is supposed to vary from how you describe?
Glusterfs has a growing number of translaters and eventually is likely
to have native NFS & Cifs support straight into the cluster. So *in
theory* (difference between theory and practice? In theory nothing, in
practice everything.) you are getting parallel NFS performance as you
add nodes, with the option of also adding redundancy and HA for free...
I get the impression the current implementation deviates somewhat from
theory, but long term that's the goal...
I was giving this some thought - essentially the whole problem comes down to either some kind of filesharing system which offers up individual files, or some kind of block level sharing and you have to then run your own filesystem over the block device.
Now, if latency were zero and fileserver had infinite CPU/bandwidth then it would seem like the filesharing system wins because it centralises the locking and all other problems and leaves relatively thin clients
On the flip side since latency/bandwidth very much deviates from perfect then to me the block level storage initially seems more attractive because the client can be given "intelligence" about the constraints and make appropriate choices about fetching blocks, ordering, caching, flushing, etc. However, if we assume active/active clusters are required then we need GFS or similar and we have just added a whole heap of latency and locking management. This plus the latency of translating a disk based protocol (scsi/ata) into network packets suddenly makes the block level option look a lot less attractive...
So the final conclusion seems like it's a hard problem and the "best" solution is going to come down to an engineering decision - ie where theory and practice deviate and which one actually gets the job done fastest in practice?
At least in theory it seems like Gluster should be able to rival the speed of a high end iSCSI san - whether the practical engineering problems are ever solved is a different matter... (Random quote - http://www.voicesofit.com/blogs/blog1.php/2009/12/29/gluster-the-red-hat-of-...
- Gluster claim 131,000 IOPS on some random benchmark using 8 servers and 18TB of storage...)
Interesting seeing how this stuff is maturing though! Sounds like the SAN is still the king for people just want something fast reliable and off the shelf today...
Ed W
Ed W put forth on 2/22/2010 7:03 AM:
Can you tell me exactly what "Consolidated Backup" means with respect to ESX please? From the brief description on the website I'm not quite sure how it varies to say backing up the raw storage using some kind of snapshot method?
Here's a decent write up Ed that should answer your questions: http://www.petri.co.il/virtual-vmware-consolidated-backup-vcb.htm
I was giving this some thought - essentially the whole problem comes down to either some kind of filesharing system which offers up individual files, or some kind of block level sharing and you have to then run your own filesystem over the block device.
The best solution for this currently existing on this blue planet is SGI's CXFS. It is the clustered version of XFS, sharing an identical on-disk format. It is the highest performance and most reliable parallel/cluster filesystem available. It was initially released simultaneously with XFS in 1994. It is a clustered file system requiring FC SAN storage. One host acts as a CXFS metadata server. All hosts in the cluster directly access the same LUN on the disk array controller. The metadata server coordinates the notification of blocks that are locked for write access by a particular node. The performance is greater than GFS and similar parallel filesystems due to the centralized metadata server reducing chatter and message latency, and the fact that the on-disk filesystem is XFS, which as we've discussed is the fastest filesystem available (aggregate across multiple benchmarks). Unfortunately, SGI did not open source CXFS, only XFS. CXFS must still be licensed from SGI. I do not know the cost. For many environments, the cost is irrelevant, as there simply is no other solution to meet their needs. For something like clustering IMAP server data for redundancy, CXFS is probably overkill. GFS2 should be fine for clustered IMAP storage.
-- Stan
participants (5)
-
Dominik Schulz
-
Ed W
-
Eric Rostetter
-
Stan Hoeppner
-
Wayne Thursby