[Dovecot] Providing shared folders with multiple backend servers

Sun Jan 8 17:39:45 EET 2012

Stan Hoeppner <stan at hardwarefreak.com> wrote:
> On 1/7/2012 7:55 PM, Sven Hartge wrote:
>> Stan Hoeppner <stan at hardwarefreak.com> wrote:
>> 
>>> It's highly likely your problems can be solved without the drastic
>>> architecture change, and new problems it will introduce, that you
>>> describe below.
>> 
>> The main reason is I need to replace the hardware as its service
>> contract ends this year and I am not able to extend it further.
>> 
>> The box so far is fine, there are normally no problems during normal
>> operations with speed or responsiveness towards the end-user.
>> 
>> Sometimes, higher peak loads tend to strain the system a bit and this is
>> starting to occur more often.
> ...
>> First thought was to move this setup into our VMware cluster (yeah, I
>> know, spare me the screams), since the hardware used there is way more
>> powerfull than the hardware used now and I wouldn't have to buy new
>> servers for my mail system (which is kind of painful to do in an
>> universitary environment, especially in Germany, if you want to invest
>> an amount of money above a certain amount).

> What's wrong with moving it onto VMware?  This actually seems like a
> smart move given your description of the node hardware.  It also gives
> you much greater backup flexibility with VCB (or whatever they call it
> today).  You can snapshot the LUN over the SAN during off peak hours to
> a backup server and do the actual backup to the library at your leisure.
> Forgive me if the software names have changed as I've not used VMware
> since ESX3 back in 07.

VCB as it was back in the days is dead. But yes, one of the reasons to
use a VM was to be able to easily backup the whole shebang.

>> But then I thought about the problems with VMs this size and got to the
>> idea with the distributed setup, splitting the one server into 4 or 6
>> backend servers.

> Not sure what you mean by "VMs this size".  Do you mean memory
> requirements or filesystem size?  If the nodes have enough RAM that's no
> issue.

Memory size. I am a bit hesistant to deploy a VM with 16GB of RAM. My
cluster nodes each have 48GB, so no problem on this side though.

> And surely you're not thinking of using a .vmdk for the mailbox
> storage.  You'd use an RDM SAN LUN.

No, I was not planning to use a VMDK backed disk for this.

> In fact you should be able to map in the existing XFS storage LUN and
> use it as is.  Assuming it's not going into retirement as well.

It is going to be retired as well, as it is as old as the server.

It also is not connected to any SAN as well, only local to the
backend server.

And our VMware SAN is iSCSI based, so no way to plug a FC-based storage
into it.

> If an individual VMware node don't have sufficient RAM you could build a
> VM based Dovecot cluster, run these two VMs on separate nodes, and thin
> out the other VMs allowed to run on these nodes.  Since you can't
> directly share XFS, build a tiny Debian NFS server VM and map the XFS
> LUN to it, export the filesystem to the two Dovecot VMs.  You could
> install the Dovecot director on this NFS server VM as well.  Converting
> from maildir to mdbox should help eliminate the NFS locking problems.  I
> would do the conversion before migrating to this VM setup with NFS.

> Also, run the NFS server VM on the same physical node as one of the
> Dovecot servers.  The NFS traffic will be a memory-memory copy instead
> of going over the GbE wire, decreasing IO latency and increasing
> performance for that Dovecot server.  If it's possible to have Dovecot
> director or your fav load balancer weight more connections to one
> Deovecot node, funnel 10-15% more connections to this one.  (I'm no
> director guru, in fact haven't use it yet).

So, this reads like my idea in the first place.

Only you place all the mails on the NFS server, whereas my idea was to
just share the shared folders from a central point and keep the normal
user dirs local to the different nodes, thus reducing network impact for
the way more common user access.

> Assuming the CPUs in the VMware cluster nodes are clocked a decent
> amount higher than 1.8GHz I wouldn't monkey with configuring virtual smp
> for these two VMs, as they'll be IO bound not CPU bound.

2.3GHz for most VMware nodes.

>>>> Ideas? Suggestions? Nudges in the right direction?
>> 
>>> Yes.  We need more real information.  Please provide:
>> 
>>> 1.  Mailbox count, total maildir file count and size
>> 
>> about 10,000 Maildir++ boxes
>> 
>> 900GB for 1300GB used, "df -i" says 11 million inodes used

> Converting to mdbox will take a large burden off your storage, as you've
> seen.  With ~1.3TB consumed of ~15TB you should have plenty of space to
> convert to mdbox while avoiding filesystem fragmentation.

You got the numbers wrong. And I got a word wrong ;)

Should have read "900GB _of_ 1300GB used".

I am using 900GB of 1300GB. The disks are SATA1.5 (not SATA3 or SATA6)
as in data transfer rate. The disks each are 150GB in size, so my
maximum storage size of my underlying VG is 1500GB.

root at ms1:~# vgs
  VG   #PV #LV #SN Attr   VSize  VFree  
  vg01   1   6   0 wz--n- 70.80G  40.80G
  vg02   1   1   0 wz--n-  1.45T 265.00G
  vg03   1   1   0 wz--n-  1.09T      0 

Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg02-home_lv
                      1.2T  867G  357G  71% /home
/dev/mapper/vg03-backup_lv
                      1.1T  996G  122G  90% /backup

So not much wiggle room left.

But modifications to our systems are made, which allow me to
temp-disable a user, convert and move his mailbox and re-enable him,
which allows me to move them one at a time from the old system to the
new one, without losing a mail or disrupting service to long and often.

>> Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a
>> low point in the usage pattern:
>> 
>>              total       used       free     shared    buffers     cached
>> Mem:      12335820    9720252    2615568          0      53112     680424
>> -/+ buffers/cache:    8986716    3349104
>> Swap:      5855676      10916    5844760

> Ugh...  "-m" and "-g" options exist for a reason. :)  So this box has
> 12GB RAM, currently ~2.5GB free during off peak hours.  It would be
> interesting to see free RAM and swap usage values during peak.  That
> would tell use whether we're CPU or RAM starved.  If both turned up
> clean then we'd need to look at iowait.  If you're not RAM starved then
> moving to VMware nodes with 16/24/32GB RAM should work fine, as long as
> you don't stack many other VMs on top.  Enabling memory dedup may help a
> little.

Well, peak hours are somewhat between 10:00 and 14:00 o'clock. Will
check then.

>> System reaches its 7 year this summer which is the end of its service
>> contract.

> Enjoy your retirement old workhorse. :)

>>> 4.  Storage configuration--total spindles, RAID level, hard or soft RAID
>> 
>> RAID 6 with 12 SATA1.5 disks, external 4Gbit FC 

> I assume this means a LUN on a SAN array somewhere on the other end of
> that multi-mode cable, yes?  Can you tell us what brand/model the box is?

This is a Transtec Provigo 610. This is a 24 disk enclosure, 12 disks
with 150GB (7.200k) each for the main mail storage in RAID6 and another
10 disks with 150GB (5.400k) for a backup LUN. I daily rsnapshot my
/home onto this local backup (20 days of retention), because it is
easier to restore from than firing up Bacula, which has the long
retention time of 90 days. But must users need a restore of mails from
$yesterday or $the_day_before.

>> Back in 2005, a SAS enclosure was way to expensive for us to afford.

> How one affords an FC SAN array but not a less expensive direct attach
> SAS enclosure is a mystery... :)

Well, it was either Parallel-SCSI or FC back then, as far as I can
remember. The price difference between the U320 version and the FC one
was not so big and I wanted to avoid having to route those big SCSI-U320
through my racks.

>>> 5.  Filesystem type
>> 
>> XFS in a LVM to allow snapshots for backup

> XFS is the only way to fly, IMNSHO.

>> I of course aligned the partions on the RAID correctly and of course
>> created a filesystem with the correct parameters wrt. spindels, chunk
>> size, etc.

> Which is critical for mitigating the RMW penalty of parity RAID.
> Speaking of which, why RAID6 for maildir?  Given that your array is 90%
> vacant, why didn't you go with RAID10 for 3-5 times the random write
> performance?

See above, not 1500GB disks, but 150GB ones. RAID6, because I wanted the
double security. I have been kind of burned by the previous system and I
tend to get nervous while tinking about data loss in my mail storage,
because I know my users _will_ give me hell if that happens.

>>> 6.  Backup software/method
>> 
>> Full backup with Bacula, taking about 24 hours right now. Because of
>> this, I switched to virtual full backups, only ever doing incremental
>> and differental backups off of the real system and creating synthetic
>> full backups inside Bacula. Works fine though, incremental taking 2
>> hours, differential about 4 hours.

> Move to VMware and use VCB.  You'll fall in love.

>> The main problem of the backup time is Maildir++. During a test, I
>> copied the mail storage to a spare box, converted it to mdbox (50MB
>> file size) and the backup was lightning fast compared to the Maildir++
>> format.

> Well of course.  You were surprised by this?

No, I was not surprised by the speedup, I _knew_ mdbox would backup
faster. Just how big it was. That a backup of 100 big files is faster
than a backup of 100,000 little files is not exactly rocket sience.

> How long has it been since you used mbox?  mbox backs up even faster
> than mdbox.  Why?  Larger files and fewer of them.  Which means the
> disks can actually do streaming reads, and don't have to beat their
> heads to death jumping all over the platters to read maildir files,
> which are scattered all over the place when created.  Which is while
> maildir is described as a "random" IO workload.

I never used mbox as an admin. The box before the box before this one
uses uw-imapd with mbox and I experienced the system as a user and it
was horriffic. Most users back then never heard of IMAP folders and just
stored their mails inside of INBOX, which of course got huge. If one of
those users with a big mbox then deleted mails, it would literally lock
the box up for everyone, as uw-imapd was copying (for example) a 600MB
mbox file around to delete one mail.

Of course, this was mostly because of the crappy uw-imapd and secondly
by some poor design choices in the server itself (underpowered RAID
controller, to small cache and a RAID5 setup, low RAM in the server).

So the first thing we did back then, in 2004, was to change to Courier
and convert from mbox to maildir, which made the mailsystem fly again,
even on the same hardware, only the disk setup changed to RAID10.

Then we bought new hardware (the one previous to the current one), this
time with more RAM, better RAID controller, smarter disk setup. We
outgrew this one really fast and a disk upgrade was not possible; it
lasted only 2 years.

So the next one got this external 24 disk array with 12 disks used at
deployment.

But Courier is showing its age and things like Sieve are only possible
with great pain, so I want to avoid it.

>> So this is the way to go, I think, regardless of which way I implement
>> the backend mail server.

> Which is why I asked my questions. :)  mdbox would have been one of my
> recommendations, but you already discovered it.

And this is why I kind of hold this upgrade back until dovecot 2.1 is
released, as it has some optimizations here.

>>> 7.  Operating system
>> 
>> Debian Linux Lenny, currently with kernel 2.6.39

> :) Debian, XFS, Dovecot, FC SAN storage--I like your style.  Lenny with
> 2.6.39?  Is that a backport or rolled kernel?  Not Squeeze?

That is a BPO-kernel. Not-yet Squeeze. I admin over 150 different
systems here, plus I am the main VMware and SAN admin. So upgrades take
some time until I grow an extra pair of eyes and arms. ;)

And since I have been planning to re-implement the mailsystem for some
time now, I held the update to the storage backends back. No use in
disrupting service for the end user if I'm going to replace the whole
thing with a new one in the end.

>>> Instead of telling us what you think the solution to your unidentified
>>> bottleneck is and then asking "yeah or nay", tell us what the problem is
>>> and allow us to recommend solutions.
>> 
>> I am not asking for "yay or nay", I just pointed out my idea, but I am
>> open to other suggestions.

> I think you've already discovered the best suggestions on your own.

>> If the general idea is to buy a new big single storage system, I am more
>> than happy to do just this, because this will prevent any problems I might
>> have with a distributed one before they even can occur.

> One box is definitely easier to administer and troubleshoot.  Though I
> must say that even though it's more complex, I think the VM architecture
> I described is worth a serious look.  If your current 12x1.5TB SAN array
> is being retired as well, you could piggy back onto the array(s) feeding
> the VMware farm, or expand them if necessary/possible.  Adding drives is
> usually much cheaper than buying a new populated array chassis.  Given
> your service contract comments it's unlikely you're the type to build
> your own servers.  Being a hardwarefreak, I nearly always build my
> servers and storage from scratch.

Naa, I have been doing this for too long. While I am perfectly capable
of building such a server myself, I am now the kind of guy who wants to
"yell" at a vendor, when their hardware fails.

Which does not mean I am using any "Express" package or preconfigured
server, I still read the specs and pick the parts which make the most
sense for a job and then have that one custom build by HP or IBM or
Dell or ...

Personal build PCs and servers out of single parts have been nothing
than a nightmare for me. And: my cowworkers need to be able to service
them as well while I am not available and they are not as a hardware
aficionado as I am.

So "professional" hardware with a 5 to 7 year support contract is the
way to go for me.

> If you're hooked on 1U chassis (I hate em) go with the DL165 G7.  If not
> I'd go 2U, the DL385 G7.  Magny Cours gives you more bang for the buck
> in this class of machines.

I have plenty space for 2U systems and already use DL385 G7s, I am not
fixed on Intel or AMD, I'll gladly use the one which is the most fit for
a given jobs.

Grüße,
Sven

-- 
Sigmentation fault. Core dumped.