[Dovecot] Providing shared folders with multiple backend servers

Sun Jan 8 15:09:00 EET 2012

On 1/7/2012 7:55 PM, Sven Hartge wrote:
> Stan Hoeppner <stan at hardwarefreak.com> wrote:
> 
>> It's highly likely your problems can be solved without the drastic
>> architecture change, and new problems it will introduce, that you
>> describe below.
> 
> The main reason is I need to replace the hardware as its service
> contract ends this year and I am not able to extend it further.
> 
> The box so far is fine, there are normally no problems during normal
> operations with speed or responsiveness towards the end-user.
> 
> Sometimes, higher peak loads tend to strain the system a bit and this is
> starting to occur more often.
...
> First thought was to move this setup into our VMware cluster (yeah, I
> know, spare me the screams), since the hardware used there is way more
> powerfull than the hardware used now and I wouldn't have to buy new
> servers for my mail system (which is kind of painful to do in an
> universitary environment, especially in Germany, if you want to invest
> an amount of money above a certain amount).

What's wrong with moving it onto VMware?  This actually seems like a
smart move given your description of the node hardware.  It also gives
you much greater backup flexibility with VCB (or whatever they call it
today).  You can snapshot the LUN over the SAN during off peak hours to
a backup server and do the actual backup to the library at your leisure.
 Forgive me if the software names have changed as I've not used VMware
since ESX3 back in 07.

> But then I thought about the problems with VMs this size and got to the
> idea with the distributed setup, splitting the one server into 4 or 6
> backend servers.

Not sure what you mean by "VMs this size".  Do you mean memory
requirements or filesystem size?  If the nodes have enough RAM that's no
issue.  And surely you're not thinking of using a .vmdk for the mailbox
storage.  You'd use an RDM SAN LUN.  In fact you should be able to map
in the existing XFS storage LUN and use it as is.  Assuming it's not
going into retirement as well.

If an individual VMware node don't have sufficient RAM you could build a
VM based Dovecot cluster, run these two VMs on separate nodes, and thin
out the other VMs allowed to run on these nodes.  Since you can't
directly share XFS, build a tiny Debian NFS server VM and map the XFS
LUN to it, export the filesystem to the two Dovecot VMs.  You could
install the Dovecot director on this NFS server VM as well.  Converting
from maildir to mdbox should help eliminate the NFS locking problems.  I
would do the conversion before migrating to this VM setup with NFS.

Also, run the NFS server VM on the same physical node as one of the
Dovecot servers.  The NFS traffic will be a memory-memory copy instead
of going over the GbE wire, decreasing IO latency and increasing
performance for that Dovecot server.  If it's possible to have Dovecot
director or your fav load balancer weight more connections to one
Deovecot node, funnel 10-15% more connections to this one.  (I'm no
director guru, in fact haven't use it yet).

Assuming the CPUs in the VMware cluster nodes are clocked a decent
amount higher than 1.8GHz I wouldn't monkey with configuring virtual smp
for these two VMs, as they'll be IO bound not CPU bound.

> As I said: "idea". Other ideas making my life easier are more than
> welcome.

I hope my suggestions contribute to doing so. :)

>>> Ideas? Suggestions? Nudges in the right direction?
> 
>> Yes.  We need more real information.  Please provide:
> 
>> 1.  Mailbox count, total maildir file count and size
> 
> about 10,000 Maildir++ boxes
> 
> 900GB for 1300GB used, "df -i" says 11 million inodes used

Converting to mdbox will take a large burden off your storage, as you've
seen.  With ~1.3TB consumed of ~15TB you should have plenty of space to
convert to mdbox while avoiding filesystem fragmentation.  With maildir
you likely didn't see heavy fragmentation due to small file sizes.  With
mdbox, especially at 50MB, you'll likely start seeing more
fragmentation.  Use this to periodically check the fragmentation level:

$ xfs_db -c frag [device] -r
e.g.
$  xfs_db -c frag /dev/sda7 -r
actual 76109, ideal 75422, fragmentation factor 0.90%

I'd recommend running xfs_fsr when frag factor exceeds ~20-30%.  The XFS
developers recommend against running xfs_fsr too often as it can
actually increases free space fragmentation while it decreases file
fragmentation, especially on filesystems that are relatively full.
Having heavily fragmented free space is worse than having fragmented
files, as newly created files will automatically be fragged.

> I know, this is very _tiny_ compared to the systems ISPs are using.

Not everyone is an ISP, including me. :)

>> 2.  Average/peak concurrent user connections
> 
> IMAP: Average 800 concurrent user connections, peaking at about 1400.
> POP3: Average 300 concurrent user connections, peaking at about 600.
> 
>> 3.  CPU type/speed/total core count, total RAM, free RAM (incl buffers)
> 
> Currently dual-core AMD Opteron 2210, 1.8GHz.

Heheh, yeah, a bit long in the tooth, but not horribly underpowered for
1100 concurrent POP/IMAP users.  Though this may be the reason for the
sluggishness when you hit that 2000 concurrent user peak.  Any chance
you have some top output for the peak period?

> Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a
> low point in the usage pattern:
> 
>              total       used       free     shared    buffers     cached
> Mem:      12335820    9720252    2615568          0      53112     680424
> -/+ buffers/cache:    8986716    3349104
> Swap:      5855676      10916    5844760

Ugh...  "-m" and "-g" options exist for a reason. :)  So this box has
12GB RAM, currently ~2.5GB free during off peak hours.  It would be
interesting to see free RAM and swap usage values during peak.  That
would tell use whether we're CPU or RAM starved.  If both turned up
clean then we'd need to look at iowait.  If you're not RAM starved then
moving to VMware nodes with 16/24/32GB RAM should work fine, as long as
you don't stack many other VMs on top.  Enabling memory dedup may help a
little.

> System reaches its 7 year this summer which is the end of its service
> contract.

Enjoy your retirement old workhorse. :)

>> 4.  Storage configuration--total spindles, RAID level, hard or soft RAID
> 
> RAID 6 with 12 SATA1.5 disks, external 4Gbit FC 

I assume this means a LUN on a SAN array somewhere on the other end of
that multi-mode cable, yes?  Can you tell us what brand/model the box is?

> Back in 2005, a SAS enclosure was way to expensive for us to afford.

How one affords an FC SAN array but not a less expensive direct attach
SAS enclosure is a mystery... :)

>> 5.  Filesystem type
> 
> XFS in a LVM to allow snapshots for backup

XFS is the only way to fly, IMNSHO.

> I of course aligned the partions on the RAID correctly and of course
> created a filesystem with the correct parameters wrt. spindels, chunk
> size, etc.

Which is critical for mitigating the RMW penalty of parity RAID.
Speaking of which, why RAID6 for maildir?  Given that your array is 90%
vacant, why didn't you go with RAID10 for 3-5 times the random write
performance?

>> 6.  Backup software/method
> 
> Full backup with Bacula, taking about 24 hours right now. Because of
> this, I switched to virtual full backups, only ever doing incremental
> and differental backups off of the real system and creating synthetic
> full backups inside Bacula. Works fine though, incremental taking 2
> hours, differential about 4 hours.

Move to VMware and use VCB.  You'll fall in love.

> The main problem of the backup time is Maildir++. During a test, I
> copied the mail storage to a spare box, converted it to mdbox (50MB
> file size) and the backup was lightning fast compared to the Maildir++
> format.

Well of course.  You were surprised by this?  How long has it been since
you used mbox?  mbox backs up even faster than mdbox.  Why?  Larger
files and fewer of them.  Which means the disks can actually do
streaming reads, and don't have to beat their heads to death jumping all
over the platters to read maildir files, which are scattered all over
the place when created.  Which is while maildir is described as a
"random" IO workload.

> Additonally compressing the mails inside the mdbox and not having Bacula
> compress them for me reduce the backup time further (and speeding up the
> access through IMAP and POP3).

Again, no surprise here.  When files exist on disk already compressed it
takes less IO bandwidth to read the file data for a given actual file
size.  So if you have say 10MB files that compress down to 5MB, you can
read twice as many files when the pipe is saturated, twice as much file
data.

> So this is the way to go, I think, regardless of which way I implement
> the backend mail server.

Which is why I asked my questions. :)  mdbox would have been one of my
recommendations, but you already discovered it.

>> 7.  Operating system
> 
> Debian Linux Lenny, currently with kernel 2.6.39

:) Debian, XFS, Dovecot, FC SAN storage--I like your style.  Lenny with
2.6.39?  Is that a backport or rolled kernel?  Not Squeeze?
Interesting.  I'm running Squeeze with rolled vanilla 2.6.38.6.  It's
been about 6 months so it's 'bout time I roll a new one. :)

>> Instead of telling us what you think the solution to your unidentified
>> bottleneck is and then asking "yeah or nay", tell us what the problem is
>> and allow us to recommend solutions.
> 
> I am not asking for "yay or nay", I just pointed out my idea, but I am
> open to other suggestions.

I think you've already discovered the best suggestions on your own.

> If the general idea is to buy a new big single storage system, I am more
> than happy to do just this, because this will prevent any problems I might
> have with a distributed one before they even can occur.

One box is definitely easier to administer and troubleshoot.  Though I
must say that even though it's more complex, I think the VM architecture
I described is worth a serious look.  If your current 12x1.5TB SAN array
is being retired as well, you could piggy back onto the array(s) feeding
the VMware farm, or expand them if necessary/possible.  Adding drives is
usually much cheaper than buying a new populated array chassis.  Given
your service contract comments it's unlikely you're the type to build
your own servers.  Being a hardwarefreak, I nearly always build my
servers and storage from scratch.  This may be worth a look merely for
educational purposes.  I just happened to have finished spec'ing out a
new high volume 20TB IMAP server recently which should handle 5000
concurrent users without breaking a sweat, for only ~$7500 USD:

Full parts list:
http://secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069985

Summary:
2GHz 8-core 12MB L3 cache Magny Cours Opteron
SuperMicro MBD-H8SGL-O w/32GB qualified quad channel reg ECC DDR3/1333
  dual Intel 82574 GbE ports
LSI 512MB PCIe 2.0 x8 RAID, 24 port SAS expander, 20x1TB 7.2k WD RE4
20 bay SAS/SATA 6G hot swap Norco chassis

Create a RAID1 pair for /boot, the root filesystem, swap partition of
say 8GB, 2GB partition for external XFS log, should have ~900GB left for
utilitarian purposes.  Configure two spares.  Configure the remaining 16
drives as RAID10 with a 64KB stripe size (8KB, 16 sector strip size),
yielding 8TB raw for the XFS backed mdbox mailstore.  Enable the BBWC
write cache (dang, forgot the battery module, +$175).  This should yield
approximately 8*150 = 1200 IOPS peak to/from disk, many thousands to
BBWC, more than plenty for 5000 concurrent users given the IO behavior
of most MUAs.  Channel bond the NICs to the switch or round robin DNS
the two IPs if pathing for redundancy.

What's that?  You want to support 10K users?  Simply drop in another 4
sticks of the 8GB Kingston Reg ECC RAM for 64GB total, and plug one of
these into the external SFF8088 port on the LSI card:
http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047
populated with 18 of the 1TB RE4 drives.  Configure 16 drives the same
as the primary array, grow it into your existing XFS.  Since you have
two identical arrays comprising the filesystem, sunit/swidth values are
still valid so you don't need to add mount options.  Configure 2 drives
as hot spares.  The additional 16 drive RAID10 doubles our disk IOPS to
~2400, maintaining our concurrent user to IOPS ratio at ~4:1, and
doubles our mail storage to ~16TB.

This expansion hardware will run an additional ~$6200.  Grand total to
support ~10K concurrent users (maybe more) with a quality DIY build is
just over $14K USD, or ~$1.40 per mailbox.  Not too bad for an 8-core,
64GB server with 32TB of hardware RAID10 mailbox storage and 38 total
1TB disks.  I haven't run the numbers for a comparable HP system, but an
educated guess says it would be quite a bit more expensive, not the
server so much, but the storage.  HP's disk drive prices are outrageous,
though not approaching anywhere near the level of larceny EMC commits
with it's drive sales.  $2400 for a $300 Seagate drive wearing an EMC
cape?  Please....

> Maybe two HP DL180s (one for production and one as test/standby-system)
> with an SAS attached enclosure for storage?

If you're hooked on 1U chassis (I hate em) go with the DL165 G7.  If not
I'd go 2U, the DL385 G7.  Magny Cours gives you more bang for the buck
in this class of machines.  The performance is excellent, and, if
everybody buys Intel, AMD goes bankrupt, and then Chipzilla charges
whatever it desires.  They've already been sanctioned, and fined by the
FTC at least twice.  They paid Intergraph $800 million in an antitrust
settlement in 2000 after they forced them out of the hardware business.
 They recently paid AMD $1 Billion in an antitrust settlement.  They're
just like Microsoft, putting competitors out of business by any and all
means necessary, even if their conduct is illegal.  Yes, I'd much rather
give AMD my business, given they had superior CPUs to Intel for many
years, and their current chips are still more than competitive.  /end
rant. ;)

> Keeping in mind the new system has to work for some time (again 5 to 7
> years) I have to be able to extend the storage space without to much
> hassle.

Given you're currently only using ~1.3TB of ~15TB do you really see this
as an issue?  Will you be changing your policy or quotas?  Will the
university double its enrollment?  If not I would think a new 12-16TB
raw array would be more than plenty.

If you really want growth potential get a SATABeast and start with 14
2TB SATA drives.  You'll still have 28 empty SAS/SATA slots in the 4U
chassis, 42 total.  Max capacity is 84TB.  You get dual 8Gb/s FC LC
ports and dual GbE iSCSI ports per controller, all ports active, two
controllers max.  The really basic SKU runs about $20-25K USD with the
single controller and a few small drives, before
institutional/educational discounts.  www.nexsan.com/satabeast

I've used the SATABlade and SATABoy models (8 and 14 drives) and really
like the simplicity of design and the httpd management interface.  Good
products, and one of the least expensive and feature rich in this class.

Sorry this was so windy.  I am the hardwarefreak after all. :)

-- 
Stan