Stan Hoeppner <stan@hardwarefreak.com> wrote:
On 1/7/2012 7:55 PM, Sven Hartge wrote:
Stan Hoeppner <stan@hardwarefreak.com> wrote:
It's highly likely your problems can be solved without the drastic architecture change, and new problems it will introduce, that you describe below.
The main reason is I need to replace the hardware as its service contract ends this year and I am not able to extend it further.
The box so far is fine, there are normally no problems during normal operations with speed or responsiveness towards the end-user.
Sometimes, higher peak loads tend to strain the system a bit and this is starting to occur more often. ... First thought was to move this setup into our VMware cluster (yeah, I know, spare me the screams), since the hardware used there is way more powerfull than the hardware used now and I wouldn't have to buy new servers for my mail system (which is kind of painful to do in an universitary environment, especially in Germany, if you want to invest an amount of money above a certain amount).
What's wrong with moving it onto VMware? This actually seems like a smart move given your description of the node hardware. It also gives you much greater backup flexibility with VCB (or whatever they call it today). You can snapshot the LUN over the SAN during off peak hours to a backup server and do the actual backup to the library at your leisure. Forgive me if the software names have changed as I've not used VMware since ESX3 back in 07.
VCB as it was back in the days is dead. But yes, one of the reasons to use a VM was to be able to easily backup the whole shebang.
But then I thought about the problems with VMs this size and got to the idea with the distributed setup, splitting the one server into 4 or 6 backend servers.
Not sure what you mean by "VMs this size". Do you mean memory requirements or filesystem size? If the nodes have enough RAM that's no issue.
Memory size. I am a bit hesistant to deploy a VM with 16GB of RAM. My cluster nodes each have 48GB, so no problem on this side though.
And surely you're not thinking of using a .vmdk for the mailbox storage. You'd use an RDM SAN LUN.
No, I was not planning to use a VMDK backed disk for this.
In fact you should be able to map in the existing XFS storage LUN and use it as is. Assuming it's not going into retirement as well.
It is going to be retired as well, as it is as old as the server.
It also is not connected to any SAN as well, only local to the backend server.
And our VMware SAN is iSCSI based, so no way to plug a FC-based storage into it.
If an individual VMware node don't have sufficient RAM you could build a VM based Dovecot cluster, run these two VMs on separate nodes, and thin out the other VMs allowed to run on these nodes. Since you can't directly share XFS, build a tiny Debian NFS server VM and map the XFS LUN to it, export the filesystem to the two Dovecot VMs. You could install the Dovecot director on this NFS server VM as well. Converting from maildir to mdbox should help eliminate the NFS locking problems. I would do the conversion before migrating to this VM setup with NFS.
Also, run the NFS server VM on the same physical node as one of the Dovecot servers. The NFS traffic will be a memory-memory copy instead of going over the GbE wire, decreasing IO latency and increasing performance for that Dovecot server. If it's possible to have Dovecot director or your fav load balancer weight more connections to one Deovecot node, funnel 10-15% more connections to this one. (I'm no director guru, in fact haven't use it yet).
So, this reads like my idea in the first place.
Only you place all the mails on the NFS server, whereas my idea was to just share the shared folders from a central point and keep the normal user dirs local to the different nodes, thus reducing network impact for the way more common user access.
Assuming the CPUs in the VMware cluster nodes are clocked a decent amount higher than 1.8GHz I wouldn't monkey with configuring virtual smp for these two VMs, as they'll be IO bound not CPU bound.
2.3GHz for most VMware nodes.
Ideas? Suggestions? Nudges in the right direction?
Yes. We need more real information. Please provide:
- Mailbox count, total maildir file count and size
about 10,000 Maildir++ boxes
900GB for 1300GB used, "df -i" says 11 million inodes used
Converting to mdbox will take a large burden off your storage, as you've seen. With ~1.3TB consumed of ~15TB you should have plenty of space to convert to mdbox while avoiding filesystem fragmentation.
You got the numbers wrong. And I got a word wrong ;)
Should have read "900GB _of_ 1300GB used".
I am using 900GB of 1300GB. The disks are SATA1.5 (not SATA3 or SATA6) as in data transfer rate. The disks each are 150GB in size, so my maximum storage size of my underlying VG is 1500GB.
root@ms1:~# vgs
VG #PV #LV #SN Attr VSize VFree
vg01 1 6 0 wz--n- 70.80G 40.80G
vg02 1 1 0 wz--n- 1.45T 265.00G
vg03 1 1 0 wz--n- 1.09T 0
Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg02-home_lv 1.2T 867G 357G 71% /home /dev/mapper/vg03-backup_lv 1.1T 996G 122G 90% /backup
So not much wiggle room left.
But modifications to our systems are made, which allow me to temp-disable a user, convert and move his mailbox and re-enable him, which allows me to move them one at a time from the old system to the new one, without losing a mail or disrupting service to long and often.
Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a low point in the usage pattern:
total used free shared buffers cached
Mem: 12335820 9720252 2615568 0 53112 680424 -/+ buffers/cache: 8986716 3349104 Swap: 5855676 10916 5844760
Ugh... "-m" and "-g" options exist for a reason. :) So this box has 12GB RAM, currently ~2.5GB free during off peak hours. It would be interesting to see free RAM and swap usage values during peak. That would tell use whether we're CPU or RAM starved. If both turned up clean then we'd need to look at iowait. If you're not RAM starved then moving to VMware nodes with 16/24/32GB RAM should work fine, as long as you don't stack many other VMs on top. Enabling memory dedup may help a little.
Well, peak hours are somewhat between 10:00 and 14:00 o'clock. Will check then.
System reaches its 7 year this summer which is the end of its service contract.
Enjoy your retirement old workhorse. :)
- Storage configuration--total spindles, RAID level, hard or soft RAID
RAID 6 with 12 SATA1.5 disks, external 4Gbit FC
I assume this means a LUN on a SAN array somewhere on the other end of that multi-mode cable, yes? Can you tell us what brand/model the box is?
This is a Transtec Provigo 610. This is a 24 disk enclosure, 12 disks with 150GB (7.200k) each for the main mail storage in RAID6 and another 10 disks with 150GB (5.400k) for a backup LUN. I daily rsnapshot my /home onto this local backup (20 days of retention), because it is easier to restore from than firing up Bacula, which has the long retention time of 90 days. But must users need a restore of mails from $yesterday or $the_day_before.
Back in 2005, a SAS enclosure was way to expensive for us to afford.
How one affords an FC SAN array but not a less expensive direct attach SAS enclosure is a mystery... :)
Well, it was either Parallel-SCSI or FC back then, as far as I can remember. The price difference between the U320 version and the FC one was not so big and I wanted to avoid having to route those big SCSI-U320 through my racks.
- Filesystem type
XFS in a LVM to allow snapshots for backup
XFS is the only way to fly, IMNSHO.
I of course aligned the partions on the RAID correctly and of course created a filesystem with the correct parameters wrt. spindels, chunk size, etc.
Which is critical for mitigating the RMW penalty of parity RAID. Speaking of which, why RAID6 for maildir? Given that your array is 90% vacant, why didn't you go with RAID10 for 3-5 times the random write performance?
See above, not 1500GB disks, but 150GB ones. RAID6, because I wanted the double security. I have been kind of burned by the previous system and I tend to get nervous while tinking about data loss in my mail storage, because I know my users _will_ give me hell if that happens.
- Backup software/method
Full backup with Bacula, taking about 24 hours right now. Because of this, I switched to virtual full backups, only ever doing incremental and differental backups off of the real system and creating synthetic full backups inside Bacula. Works fine though, incremental taking 2 hours, differential about 4 hours.
Move to VMware and use VCB. You'll fall in love.
The main problem of the backup time is Maildir++. During a test, I copied the mail storage to a spare box, converted it to mdbox (50MB file size) and the backup was lightning fast compared to the Maildir++ format.
Well of course. You were surprised by this?
No, I was not surprised by the speedup, I _knew_ mdbox would backup faster. Just how big it was. That a backup of 100 big files is faster than a backup of 100,000 little files is not exactly rocket sience.
How long has it been since you used mbox? mbox backs up even faster than mdbox. Why? Larger files and fewer of them. Which means the disks can actually do streaming reads, and don't have to beat their heads to death jumping all over the platters to read maildir files, which are scattered all over the place when created. Which is while maildir is described as a "random" IO workload.
I never used mbox as an admin. The box before the box before this one uses uw-imapd with mbox and I experienced the system as a user and it was horriffic. Most users back then never heard of IMAP folders and just stored their mails inside of INBOX, which of course got huge. If one of those users with a big mbox then deleted mails, it would literally lock the box up for everyone, as uw-imapd was copying (for example) a 600MB mbox file around to delete one mail.
Of course, this was mostly because of the crappy uw-imapd and secondly by some poor design choices in the server itself (underpowered RAID controller, to small cache and a RAID5 setup, low RAM in the server).
So the first thing we did back then, in 2004, was to change to Courier and convert from mbox to maildir, which made the mailsystem fly again, even on the same hardware, only the disk setup changed to RAID10.
Then we bought new hardware (the one previous to the current one), this time with more RAM, better RAID controller, smarter disk setup. We outgrew this one really fast and a disk upgrade was not possible; it lasted only 2 years.
So the next one got this external 24 disk array with 12 disks used at deployment.
But Courier is showing its age and things like Sieve are only possible with great pain, so I want to avoid it.
So this is the way to go, I think, regardless of which way I implement the backend mail server.
Which is why I asked my questions. :) mdbox would have been one of my recommendations, but you already discovered it.
And this is why I kind of hold this upgrade back until dovecot 2.1 is released, as it has some optimizations here.
- Operating system
Debian Linux Lenny, currently with kernel 2.6.39
:) Debian, XFS, Dovecot, FC SAN storage--I like your style. Lenny with 2.6.39? Is that a backport or rolled kernel? Not Squeeze?
That is a BPO-kernel. Not-yet Squeeze. I admin over 150 different systems here, plus I am the main VMware and SAN admin. So upgrades take some time until I grow an extra pair of eyes and arms. ;)
And since I have been planning to re-implement the mailsystem for some time now, I held the update to the storage backends back. No use in disrupting service for the end user if I'm going to replace the whole thing with a new one in the end.
Instead of telling us what you think the solution to your unidentified bottleneck is and then asking "yeah or nay", tell us what the problem is and allow us to recommend solutions.
I am not asking for "yay or nay", I just pointed out my idea, but I am open to other suggestions.
I think you've already discovered the best suggestions on your own.
If the general idea is to buy a new big single storage system, I am more than happy to do just this, because this will prevent any problems I might have with a distributed one before they even can occur.
One box is definitely easier to administer and troubleshoot. Though I must say that even though it's more complex, I think the VM architecture I described is worth a serious look. If your current 12x1.5TB SAN array is being retired as well, you could piggy back onto the array(s) feeding the VMware farm, or expand them if necessary/possible. Adding drives is usually much cheaper than buying a new populated array chassis. Given your service contract comments it's unlikely you're the type to build your own servers. Being a hardwarefreak, I nearly always build my servers and storage from scratch.
Naa, I have been doing this for too long. While I am perfectly capable of building such a server myself, I am now the kind of guy who wants to "yell" at a vendor, when their hardware fails.
Which does not mean I am using any "Express" package or preconfigured server, I still read the specs and pick the parts which make the most sense for a job and then have that one custom build by HP or IBM or Dell or ...
Personal build PCs and servers out of single parts have been nothing than a nightmare for me. And: my cowworkers need to be able to service them as well while I am not available and they are not as a hardware aficionado as I am.
So "professional" hardware with a 5 to 7 year support contract is the way to go for me.
If you're hooked on 1U chassis (I hate em) go with the DL165 G7. If not I'd go 2U, the DL385 G7. Magny Cours gives you more bang for the buck in this class of machines.
I have plenty space for 2U systems and already use DL385 G7s, I am not fixed on Intel or AMD, I'll gladly use the one which is the most fit for a given jobs.
Grüße, Sven
-- Sigmentation fault. Core dumped.