[Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Sun Apr 8 21:21:47 EEST 2012

On 4/7/2012 9:43 AM, Emmanuel Noobadmin wrote:
> On 4/7/12, Stan Hoeppner <stan at hardwarefreak.com> wrote:
> 
> Firstly, thanks for the comprehensive reply. :)
> 
>> I'll assume "networked storage nodes" means NFS, not FC/iSCSI SAN, in
>> which case you'd have said "SAN".
> 
> I haven't decided on that but it would either be NFS or iSCSI over
> Gigabit. I don't exactly get a big budget for this. iSCSI because I
> planned to do md/mpath over two separate switches so that if one
> switch explodes, the email service would still work.

So it seems you have two courses of action:

1.  Identify individual current choke points and add individual systems
and storage to eliminate those choke points.

2.  Analyze your entire workflow and all systems, identifying all choke
points, then design a completely new well integrated storage
architecture that solves all current problems and addresses future needs

Adding an NFS server and moving infrequently (old) accessed files to
alternate storage will alleviate your space problems.  But it will
probably not fix some of the other problems you mention, such as servers
bogging down and becoming unresponsive, as that's not a space issue.
The cause of that would likely be an IOPS issue, meaning you don't have
enough storage spindles to service requests in a timely manner.

>> Less complexity and cost is always better.  CPU throughput isn't a
>> factor in mail workloads--it's all about IO latency.  A 1U NFS server
>> with 12 drive JBOD is faster, cheaper, easier to setup and manage, sucks
>> less juice and dissipates less heat than 4 1U servers each w/ 4 drives.
> 
> My worry is that if that one server dies, everything is dead. With at
> least a pair of servers, I could keep it running, or if necessary,
> restore the accounts on the dead servers from backup, make some config
> changes and have everything back running while waiting for replacement
> hardware.

You are a perfect candidate for VMware ESX.  The HA feature will do
exactly what you want.  If one physical node in the cluster dies, HA
automatically restarts the dead VMs on other nodes, transparently.
Clients will will have to reestablish connections, but everything else
will pretty much be intact.  Worse case scenario will possibly be a few
corrupted mailboxes that were being written when the hardware crashed.

A SAN is required for such a setup.  I had extensive experience with ESX
and HA about 5 years ago and it works as advertised.  After 5 years it
can only have improved.  It's not "cheap" but usually pays for itself
due to being able to consolidate the workload of dozens of physical
servers into just 2 or 3 boxes.

>>  I don't recall seeing your user load or IOPS requirements so I'm making
>> some educated guesses WRT your required performance and total storage.
> 
> I'm embarrassed to admit I don't have hard numbers on the user load
> except the rapidly dwindling disk space count and the fact when the
> web-based mail application try to list and check disk quota, it can
> bring the servers to a crawl. 

Maybe just starting with a description of your current hardware setup
and number of total users/mailboxes would be a good starting point.  How
many servers do you have, what storage is connected to each, percent of
MUA POP/IMAP connections from user PCs versus those from webmail
applications, etc, etc.

Probably the single most important piece of information would be the
hardware specs of your current Dovecot server, CPUs/RAM/disk array, etc,
and what version of Dovecot you're running.

The focus of your email is building a storage server strictly to offload
old mail and free up space on the Dovecot server.  From the sound of
things, this may not be sufficient to solve all your problems.

> My lame excuse is that I'm just the web
> dev who got caught holding the server admin potato.

Baptism by fire.  Ouch.  What doesn't kill you makes you stronger. ;)

>> is nearly irrelevant for a mail workload, you can see it's much cheaper
>> to scale capacity and IOPS with a single node w/fat storage than with
>> skinny nodes w/thin storage.  Ok, so here's the baseline config I threw
>> together:
> 
> One of my concern is that heavy IO on the same server slow the overall
> performance even though the theoretical IOPS of the total drives are
> the same on 1 and on X servers. Right now, the servers are usually
> screeching to a halt, to the point of even locking out SSH access due
> to IOWait sending the load in top to triple digits.

If multiple servers are screeching to a halt due to iowait, either all
of your servers individual disks are overloaded, or you already have
shared storage.  We really need more info on your current architecture.
 Right now we don't knw if we're talking about 4 servers or 40., 100
users or 10,000.

>> Some host failure redundancy is about all you'd gain from the farm
>> setup.  Dovecot shouldn't barf due to one NFS node being down, only
>> hiccup.  I.e. only imap process accessing files on the downed node would
>> have trouble.
> 
> But if I only have one big storage node and that went down, Dovecot
> would barf wouldn't it?
> Or would the mdbox format mean Dovecot would still use the local
> storage, just that users can't access the offloaded messages?

If the big storage node is strictly alt storage, and it dies, Dovecot
will still access its main mdbox storage just fine.  It simply wouldn't
be able to access the alt storage and would log errors for those requests.

If you design a whole new architecture from scratch, going with ESX and
an iSCSI SAN this whole line of thinking is moot as there is no SPOF.
This can be done with as little as two physical servers and one iSCSI
SAN array which has dual redundant controllers in the base config.
Depending on your actual IOPS needs, you could possibly consolidate
everything you have now into two physical servers and one iSCSI SAN
array, for between $30-40K USD in hardware and $8-10K in ESX licenses.
That just a guess on ESX as I don't know the current pricing.  Even if
it's that "high" it's far more than worth the price due to the capability.

Such a setup allows you to run all of your Exim, webmail, Dovecot, etc
servers on two machines, and you usually get much better performance
than with individual boxes, especially if you manually place the VMs on
the nodes for lowest network latency.  For instance, if you place your
webmail server VM on the same host as the Dovecot VM, TCP packet latency
drops from the high micro/low milliscond range into the mid nanosecond
range--a 1000x decrease in latency.  Why?  The packet transfer is now a
memory-to-memory copy through the hypervisor.  The packets never reach a
physical network interface.  You can do some of these things with free
Linux hypervisors, but AFAIK the poor management interfaces for them
make the price of ESX seem like a bargain.

>>> Also, I could possibly arrange them in a sort
>>> of network raid 1 to gain redundancy over single machine failure.
>>
>> Now you're sounding like Charles Marcus, but worse. ;)  Stay where you
>> are, and brush your hair away from your forehead.  I'm coming over with
>> my branding iron that says "K.I.S.S"

> Lol, I have no idea who Charles is, but I always feel safer if there
> was some kind of backup. Especially since I don't have the time to
> dedicate myself to server administration, by the time I notice
> something is bad, it might be too late for anything but the backup.

Search the list archives for Charles' thread about bringing up a 2nd
office site.  His desire was/is to duplicate machines at the 2nd site
for redundancy, when the proper thing to do is duplicate them at the
primary site, and simply duplicate the network links between sites.  My
point to you and Charles is that you never add complexity for the sake
of adding complexity.

> Of course management and clients don't agree with me since
> backup/redundancy costs money. :)

So does gasoline, but even as the price has more than doubled in 3 years
in the States, people haven't stopped buying it.  Why?  They have to
have it.  The case is the same for certain levels of redundancy.  You
simply have to have it.  You job is properly explaining that need.  Ask
the CEO/CFO how much money the company will lose in productivity if
nobody has email for 1 workday, as that is how long it will take to
rebuild it from scratch and restore all the data when it fails.  Then
ask what the cost is if all the email is completely lost because they
were to cheap to pay for a backup solution?

To executives, money in the bank is like the family jewels in their
trousers.  Kicking the family jewels and generating that level of pain
seriously gets their attention.  Likewise, if a failed server plus
rebuild/restore costs $50K in lost productivity, spending $20K on a
solution to prevent that from happening is a good investment.  Explain
it in terms execs understand.  Have industry data to back your position.
 There plenty of it available.

-- 
Stan