On 1/7/2012 7:55 PM, Sven Hartge wrote:
Stan Hoeppner stan@hardwarefreak.com wrote:
It's highly likely your problems can be solved without the drastic architecture change, and new problems it will introduce, that you describe below.
The main reason is I need to replace the hardware as its service contract ends this year and I am not able to extend it further.
The box so far is fine, there are normally no problems during normal operations with speed or responsiveness towards the end-user.
Sometimes, higher peak loads tend to strain the system a bit and this is starting to occur more often. ... First thought was to move this setup into our VMware cluster (yeah, I know, spare me the screams), since the hardware used there is way more powerfull than the hardware used now and I wouldn't have to buy new servers for my mail system (which is kind of painful to do in an universitary environment, especially in Germany, if you want to invest an amount of money above a certain amount).
What's wrong with moving it onto VMware? This actually seems like a smart move given your description of the node hardware. It also gives you much greater backup flexibility with VCB (or whatever they call it today). You can snapshot the LUN over the SAN during off peak hours to a backup server and do the actual backup to the library at your leisure. Forgive me if the software names have changed as I've not used VMware since ESX3 back in 07.
But then I thought about the problems with VMs this size and got to the idea with the distributed setup, splitting the one server into 4 or 6 backend servers.
Not sure what you mean by "VMs this size". Do you mean memory requirements or filesystem size? If the nodes have enough RAM that's no issue. And surely you're not thinking of using a .vmdk for the mailbox storage. You'd use an RDM SAN LUN. In fact you should be able to map in the existing XFS storage LUN and use it as is. Assuming it's not going into retirement as well.
If an individual VMware node don't have sufficient RAM you could build a VM based Dovecot cluster, run these two VMs on separate nodes, and thin out the other VMs allowed to run on these nodes. Since you can't directly share XFS, build a tiny Debian NFS server VM and map the XFS LUN to it, export the filesystem to the two Dovecot VMs. You could install the Dovecot director on this NFS server VM as well. Converting from maildir to mdbox should help eliminate the NFS locking problems. I would do the conversion before migrating to this VM setup with NFS.
Also, run the NFS server VM on the same physical node as one of the Dovecot servers. The NFS traffic will be a memory-memory copy instead of going over the GbE wire, decreasing IO latency and increasing performance for that Dovecot server. If it's possible to have Dovecot director or your fav load balancer weight more connections to one Deovecot node, funnel 10-15% more connections to this one. (I'm no director guru, in fact haven't use it yet).
Assuming the CPUs in the VMware cluster nodes are clocked a decent amount higher than 1.8GHz I wouldn't monkey with configuring virtual smp for these two VMs, as they'll be IO bound not CPU bound.
As I said: "idea". Other ideas making my life easier are more than welcome.
I hope my suggestions contribute to doing so. :)
Ideas? Suggestions? Nudges in the right direction?
Yes. We need more real information. Please provide:
- Mailbox count, total maildir file count and size
about 10,000 Maildir++ boxes
900GB for 1300GB used, "df -i" says 11 million inodes used
Converting to mdbox will take a large burden off your storage, as you've seen. With ~1.3TB consumed of ~15TB you should have plenty of space to convert to mdbox while avoiding filesystem fragmentation. With maildir you likely didn't see heavy fragmentation due to small file sizes. With mdbox, especially at 50MB, you'll likely start seeing more fragmentation. Use this to periodically check the fragmentation level:
$ xfs_db -c frag [device] -r e.g. $ xfs_db -c frag /dev/sda7 -r actual 76109, ideal 75422, fragmentation factor 0.90%
I'd recommend running xfs_fsr when frag factor exceeds ~20-30%. The XFS developers recommend against running xfs_fsr too often as it can actually increases free space fragmentation while it decreases file fragmentation, especially on filesystems that are relatively full. Having heavily fragmented free space is worse than having fragmented files, as newly created files will automatically be fragged.
I know, this is very _tiny_ compared to the systems ISPs are using.
Not everyone is an ISP, including me. :)
- Average/peak concurrent user connections
IMAP: Average 800 concurrent user connections, peaking at about 1400. POP3: Average 300 concurrent user connections, peaking at about 600.
- CPU type/speed/total core count, total RAM, free RAM (incl buffers)
Currently dual-core AMD Opteron 2210, 1.8GHz.
Heheh, yeah, a bit long in the tooth, but not horribly underpowered for 1100 concurrent POP/IMAP users. Though this may be the reason for the sluggishness when you hit that 2000 concurrent user peak. Any chance you have some top output for the peak period?
Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a low point in the usage pattern:
total used free shared buffers cached
Mem: 12335820 9720252 2615568 0 53112 680424 -/+ buffers/cache: 8986716 3349104 Swap: 5855676 10916 5844760
Ugh... "-m" and "-g" options exist for a reason. :) So this box has 12GB RAM, currently ~2.5GB free during off peak hours. It would be interesting to see free RAM and swap usage values during peak. That would tell use whether we're CPU or RAM starved. If both turned up clean then we'd need to look at iowait. If you're not RAM starved then moving to VMware nodes with 16/24/32GB RAM should work fine, as long as you don't stack many other VMs on top. Enabling memory dedup may help a little.
System reaches its 7 year this summer which is the end of its service contract.
Enjoy your retirement old workhorse. :)
- Storage configuration--total spindles, RAID level, hard or soft RAID
RAID 6 with 12 SATA1.5 disks, external 4Gbit FC
I assume this means a LUN on a SAN array somewhere on the other end of that multi-mode cable, yes? Can you tell us what brand/model the box is?
Back in 2005, a SAS enclosure was way to expensive for us to afford.
How one affords an FC SAN array but not a less expensive direct attach SAS enclosure is a mystery... :)
- Filesystem type
XFS in a LVM to allow snapshots for backup
XFS is the only way to fly, IMNSHO.
I of course aligned the partions on the RAID correctly and of course created a filesystem with the correct parameters wrt. spindels, chunk size, etc.
Which is critical for mitigating the RMW penalty of parity RAID. Speaking of which, why RAID6 for maildir? Given that your array is 90% vacant, why didn't you go with RAID10 for 3-5 times the random write performance?
- Backup software/method
Full backup with Bacula, taking about 24 hours right now. Because of this, I switched to virtual full backups, only ever doing incremental and differental backups off of the real system and creating synthetic full backups inside Bacula. Works fine though, incremental taking 2 hours, differential about 4 hours.
Move to VMware and use VCB. You'll fall in love.
The main problem of the backup time is Maildir++. During a test, I copied the mail storage to a spare box, converted it to mdbox (50MB file size) and the backup was lightning fast compared to the Maildir++ format.
Well of course. You were surprised by this? How long has it been since you used mbox? mbox backs up even faster than mdbox. Why? Larger files and fewer of them. Which means the disks can actually do streaming reads, and don't have to beat their heads to death jumping all over the platters to read maildir files, which are scattered all over the place when created. Which is while maildir is described as a "random" IO workload.
Additonally compressing the mails inside the mdbox and not having Bacula compress them for me reduce the backup time further (and speeding up the access through IMAP and POP3).
Again, no surprise here. When files exist on disk already compressed it takes less IO bandwidth to read the file data for a given actual file size. So if you have say 10MB files that compress down to 5MB, you can read twice as many files when the pipe is saturated, twice as much file data.
So this is the way to go, I think, regardless of which way I implement the backend mail server.
Which is why I asked my questions. :) mdbox would have been one of my recommendations, but you already discovered it.
- Operating system
Debian Linux Lenny, currently with kernel 2.6.39
:) Debian, XFS, Dovecot, FC SAN storage--I like your style. Lenny with 2.6.39? Is that a backport or rolled kernel? Not Squeeze? Interesting. I'm running Squeeze with rolled vanilla 2.6.38.6. It's been about 6 months so it's 'bout time I roll a new one. :)
Instead of telling us what you think the solution to your unidentified bottleneck is and then asking "yeah or nay", tell us what the problem is and allow us to recommend solutions.
I am not asking for "yay or nay", I just pointed out my idea, but I am open to other suggestions.
I think you've already discovered the best suggestions on your own.
If the general idea is to buy a new big single storage system, I am more than happy to do just this, because this will prevent any problems I might have with a distributed one before they even can occur.
One box is definitely easier to administer and troubleshoot. Though I must say that even though it's more complex, I think the VM architecture I described is worth a serious look. If your current 12x1.5TB SAN array is being retired as well, you could piggy back onto the array(s) feeding the VMware farm, or expand them if necessary/possible. Adding drives is usually much cheaper than buying a new populated array chassis. Given your service contract comments it's unlikely you're the type to build your own servers. Being a hardwarefreak, I nearly always build my servers and storage from scratch. This may be worth a look merely for educational purposes. I just happened to have finished spec'ing out a new high volume 20TB IMAP server recently which should handle 5000 concurrent users without breaking a sweat, for only ~$7500 USD:
Full parts list: http://secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069...
Summary: 2GHz 8-core 12MB L3 cache Magny Cours Opteron SuperMicro MBD-H8SGL-O w/32GB qualified quad channel reg ECC DDR3/1333 dual Intel 82574 GbE ports LSI 512MB PCIe 2.0 x8 RAID, 24 port SAS expander, 20x1TB 7.2k WD RE4 20 bay SAS/SATA 6G hot swap Norco chassis
Create a RAID1 pair for /boot, the root filesystem, swap partition of say 8GB, 2GB partition for external XFS log, should have ~900GB left for utilitarian purposes. Configure two spares. Configure the remaining 16 drives as RAID10 with a 64KB stripe size (8KB, 16 sector strip size), yielding 8TB raw for the XFS backed mdbox mailstore. Enable the BBWC write cache (dang, forgot the battery module, +$175). This should yield approximately 8*150 = 1200 IOPS peak to/from disk, many thousands to BBWC, more than plenty for 5000 concurrent users given the IO behavior of most MUAs. Channel bond the NICs to the switch or round robin DNS the two IPs if pathing for redundancy.
What's that? You want to support 10K users? Simply drop in another 4 sticks of the 8GB Kingston Reg ECC RAM for 64GB total, and plug one of these into the external SFF8088 port on the LSI card: http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 populated with 18 of the 1TB RE4 drives. Configure 16 drives the same as the primary array, grow it into your existing XFS. Since you have two identical arrays comprising the filesystem, sunit/swidth values are still valid so you don't need to add mount options. Configure 2 drives as hot spares. The additional 16 drive RAID10 doubles our disk IOPS to ~2400, maintaining our concurrent user to IOPS ratio at ~4:1, and doubles our mail storage to ~16TB.
This expansion hardware will run an additional ~$6200. Grand total to support ~10K concurrent users (maybe more) with a quality DIY build is just over $14K USD, or ~$1.40 per mailbox. Not too bad for an 8-core, 64GB server with 32TB of hardware RAID10 mailbox storage and 38 total 1TB disks. I haven't run the numbers for a comparable HP system, but an educated guess says it would be quite a bit more expensive, not the server so much, but the storage. HP's disk drive prices are outrageous, though not approaching anywhere near the level of larceny EMC commits with it's drive sales. $2400 for a $300 Seagate drive wearing an EMC cape? Please....
Maybe two HP DL180s (one for production and one as test/standby-system) with an SAS attached enclosure for storage?
If you're hooked on 1U chassis (I hate em) go with the DL165 G7. If not I'd go 2U, the DL385 G7. Magny Cours gives you more bang for the buck in this class of machines. The performance is excellent, and, if everybody buys Intel, AMD goes bankrupt, and then Chipzilla charges whatever it desires. They've already been sanctioned, and fined by the FTC at least twice. They paid Intergraph $800 million in an antitrust settlement in 2000 after they forced them out of the hardware business. They recently paid AMD $1 Billion in an antitrust settlement. They're just like Microsoft, putting competitors out of business by any and all means necessary, even if their conduct is illegal. Yes, I'd much rather give AMD my business, given they had superior CPUs to Intel for many years, and their current chips are still more than competitive. /end rant. ;)
Keeping in mind the new system has to work for some time (again 5 to 7 years) I have to be able to extend the storage space without to much hassle.
Given you're currently only using ~1.3TB of ~15TB do you really see this as an issue? Will you be changing your policy or quotas? Will the university double its enrollment? If not I would think a new 12-16TB raw array would be more than plenty.
If you really want growth potential get a SATABeast and start with 14 2TB SATA drives. You'll still have 28 empty SAS/SATA slots in the 4U chassis, 42 total. Max capacity is 84TB. You get dual 8Gb/s FC LC ports and dual GbE iSCSI ports per controller, all ports active, two controllers max. The really basic SKU runs about $20-25K USD with the single controller and a few small drives, before institutional/educational discounts. www.nexsan.com/satabeast
I've used the SATABlade and SATABoy models (8 and 14 drives) and really like the simplicity of design and the httpd management interface. Good products, and one of the least expensive and feature rich in this class.
Sorry this was so windy. I am the hardwarefreak after all. :)
-- Stan