Re: [Dovecot] POP3 vs. IMAP Load/Memory usage in Dovecot 1.0.15

11 Jul 2011

      On 7/11/2011 8:20 AM, lists@truthisfreedom.org.uk wrote:
...
Hi Stan,
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
...
On 7/11/2011 4:28 AM, lists@truthisfreedom.org.uk wrote:
...
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
This still doesn't provide us with the necessary information to give you
an intelligent answer to your question.
Sorry, I thought I'd given quite a large amount of detail so far.
To answer the questions I believe were in your analogy:

All the servers are made by the same manufacturer (Dell)
They are all the same model (R410)
The have the same engine (24 cores, 24G RAM, SAS Drives)

The R410 is a two socket Xeon box with max 2 x 6 core CPUs.  The 24 CPUs
you see is the result of HyperThreading being enabled.  I'd disable HT
if I were you, or those boxen mine.
...

The motorway is exactly the same for all servers (NFS to a NetApp 6080
and a RAMSAN)
The weather is almost exactly the same (Same Datacentre, different
rooms/racks)
The Driver is exactly the same (Dovecot 1.0.15)

What operating system?  Linux or *BSD?  If Linux, what kernel version?
Given that you're running Dovecot 1.0.15 I'm guessing you're using
CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx.  2.6.18 is 5 years
old now and not inappropriate for a modern 2 socket, 6 core
HyperThreading box.  You need a much newer kernel, preferably in the
2.6.3x series.  2.6.18 could be reporting incorrect load numbers on
these machines.
...
...
...
The vast majority of the RAM usage is cache, however there is still a
discrepancy between the IMAP servers and the POP3 servers.
It doesn't show in the top snapshots.
...
...
A discrepancy where?  RAM usage by the pop and imap processes?  Is there
any reason why you didn't post the actual data?
I thought I had explained this, but obviously not.
The discrepancies lie in two areas:

Load Average

On Linux, load average strictly shows total system CPU usage in
intervals, nothing else.  Neither memory, disk, nor network or anything
else affects load average.  Thus, with a 12 core system, until you see a
load average above 12 you have absolutely nothing to worry about.  With
HT enabled load averages pretty much go out the window as half the
"CPUs" are merely glorified duplicate register file phantoms.
Given that all mail apps are 100% IO bound, never CPU or memory bound,
I'd guess you'll never see a load average over 4.00 on any of these
machines with less than 1000 concurrent connections.  This assuming you
run a newer kernel and with HT disabled.  In other words, no more than 4
cores worth of CPU time will ever be eaten by your workload.  What
number do your Munin graphs show for load average for each set of boxes?
Do they even come close to 4?
Also note that TCP stack processing on the pop nodes will be greater
than that of the imap boxes, eating more CPU cycles.  More data sent
over the wire means more packets, more packets means more CPU time in
both code/data processing and interrupts.  If you're running iptables
rules on each host that bumps up network processing cycles a bit more yet.
...

RAM Usage (particularly in regard to cache)

...
In both cases, the value for each area is higher on the three nodes
running POP3 than the nodes running IMAP.
Almost all the memory consumption on both systems is buffer cache.  Thus
you don't have a memory issue on either host.  The kernel will free and
immediately reassign pages from cache to application processes as
needed.  I don't see evidence of the pop machine using more memory, in
fact the imap processes are using more.  Both boxes are just under 24GB
total usage and both using right at 20GB of cache.  Looks like a default
config Linux kernel based on the ultra aggressive caching and eating up
nearly all memory.
...
...
...
I guess all I'm really after knowing is if there is a reason why this is
the case so I can put my mind (and those of my team!) at ease before we
start making other changes to the infrastructure - the last thing I want
to do is increase the load on these nodes and watch them die because
they didn't have enough resources.
You still have not demonstrated what resources, if any, these nodes are
lacking.  The only thing you have mentioned is memory consumption.  All
Unices today will dump cache pages if a process needs memory space and
will instantly reallocate it.  If the bulk of the RAM on these systems
is consumed by disk cache, you don't have a problem.  If the "load" you
mentioned is caused by something other then memory usage, then can you
please show detail of such?  Could you at least provide a snapshot of
top output from one pop and one imap machine?
POP3: https://gist.github.com/1075816
IMAP: https://gist.github.com/1075821
Unfortunately I can't provide access to the Munin Graphs owing to
company policies, however I'm happy to post the output of pretty much
any command (except rm -rf ;) ) that you would like to see.
I hope that's enough detail, if not please let me know.
It may have been.  I'll know when you post your load numbers from those
top secret graphs. ;)
--
Stan