[Dovecot] POP3 vs. IMAP Load/Memory usage in Dovecot 1.0.15

older
[Dovecot] BUG: Unknown internal...

lists＠truthisfreedom.org.uk

8 Jul 2011 8 Jul '11

12:48 p.m.

Hi all,

We've just provisioned a new cluster of dovecot nodes running Centos
and Dovecot 1.0.15 (we needed to match the original configuration,
we're upgrading to 1.2 next week!).

The nodes are currently equally allocated (50/50 split) to IMAP and
POP3, with the intention to move them into a single cluster hosting
both services in the next month.

All the servers are of identical spec (24 cores, 24G RAM) and are
configured to load the indices, control files and maildirs via NFS.

We have noticed that the IMAP servers appear to be under much less
load and utilising drastically less RAM than the POP3 servers and I'm
wondering if there is a reason for this as we have seen some swapping
onto disk yet we are only handling 500 concurrent POP3 connections to
each server at any given time compared with over 600 IMAP connections.

I'm wondering if we've missed a config flag somewhere or (better
still!) this issue will go away when we upgrade to 1.2.

If anyone can shed any light on this, that would be much appreciated.

Thanks in advance,

Matt

Show replies by date

Matthew Macdonald-Wallace

11 Jul 11 Jul

9:24 a.m.

On Fri, 2011-07-08 at 10:48 +0100, lists@truthisfreedom.org.uk wrote:

...

We have noticed that the IMAP servers appear to be under much less
load and utilising drastically less RAM than the POP3 servers and I'm
wondering if there is a reason for this as we have seen some swapping
onto disk yet we are only handling 500 concurrent POP3 connections to
each server at any given time compared with over 600 IMAP connections.

Am I to take it that this is expected behaviour?

If anyone can shed more light on this I'd be very grateful.

Thanks,

Matt

Stan Hoeppner

10:27 a.m.

On 7/11/2011 1:24 AM, Matthew Macdonald-Wallace wrote:

...

On Fri, 2011-07-08 at 10:48 +0100, lists@truthisfreedom.org.uk wrote:

...
We have noticed that the IMAP servers appear to be under much less
load and utilising drastically less RAM than the POP3 servers and I'm
wondering if there is a reason for this as we have seen some swapping
onto disk yet we are only handling 500 concurrent POP3 connections to
each server at any given time compared with over 600 IMAP connections.

Am I to take it that this is expected behaviour?

If anyone can shed more light on this I'd be very grateful.

More specific information would be helpful. Load as shown through top doesn't really tell anything. Are you simply seeing memory pressure? Is all that RAM being used for block device cache or actually eaten by the pop servers?

-- Stan

lists＠truthisfreedom.org.uk

12:28 p.m.

Quoting Stan Hoeppner <stan@hardwarefreak.com>:

...

On 7/11/2011 1:24 AM, Matthew Macdonald-Wallace wrote:

...
On Fri, 2011-07-08 at 10:48 +0100, lists@truthisfreedom.org.uk wrote:

...
We have noticed that the IMAP servers appear to be under much less load and utilising drastically less RAM than the POP3 servers and I'm wondering if there is a reason for this as we have seen some swapping onto disk yet we are only handling 500 concurrent POP3 connections to each server at any given time compared with over 600 IMAP connections.

Am I to take it that this is expected behaviour?

If anyone can shed more light on this I'd be very grateful.

More specific information would be helpful. Load as shown through top doesn't really tell anything. Are you simply seeing memory pressure? Is all that RAM being used for block device cache or actually eaten by the pop servers?

Hi Stan,

Thanks for getting back to me.

The Load average comparisons are taken from Munin graphs and based
upon the servers being in production for five days between Monday and
Friday.

The vast majority of the RAM usage is cache, however there is still a
discrepancy between the IMAP servers and the POP3 servers.

I guess all I'm really after knowing is if there is a reason why this
is the case so I can put my mind (and those of my team!) at ease
before we start making other changes to the infrastructure - the last
thing I want to do is increase the load on these nodes and watch them
die because they didn't have enough resources.

Kind regards,

Matt

Stan Hoeppner

3:18 p.m.

On 7/11/2011 4:28 AM, lists@truthisfreedom.org.uk wrote:

...

Quoting Stan Hoeppner <stan@hardwarefreak.com>:

...
On 7/11/2011 1:24 AM, Matthew Macdonald-Wallace wrote:

...
On Fri, 2011-07-08 at 10:48 +0100, lists@truthisfreedom.org.uk wrote:

...
We have noticed that the IMAP servers appear to be under much less load and utilising drastically less RAM than the POP3 servers and I'm wondering if there is a reason for this as we have seen some swapping onto disk yet we are only handling 500 concurrent POP3 connections to each server at any given time compared with over 600 IMAP connections.

Am I to take it that this is expected behaviour?

If anyone can shed more light on this I'd be very grateful.

More specific information would be helpful. Load as shown through top doesn't really tell anything. Are you simply seeing memory pressure? Is all that RAM being used for block device cache or actually eaten by the pop servers?

Hi Stan,

Thanks for getting back to me.

The Load average comparisons are taken from Munin graphs and based upon the servers being in production for five days between Monday and Friday.

This still doesn't provide us with the necessary information to give you an intelligent answer to your question. You've told us you have a Mustang and a Camaro and that one burned more gas in a week than the other. You didn't tell us the driving conditions of each, whether both city driving, or one city and one highway, winter or summer, or if grandma was driving one and Mario Andretti driving the other. The details matter.

...

The vast majority of the RAM usage is cache, however there is still a discrepancy between the IMAP servers and the POP3 servers.

A discrepancy where? RAM usage by the pop and imap processes? Is there any reason why you didn't post the actual data?

...

I guess all I'm really after knowing is if there is a reason why this is the case so I can put my mind (and those of my team!) at ease before we start making other changes to the infrastructure - the last thing I want to do is increase the load on these nodes and watch them die because they didn't have enough resources.

You still have not demonstrated what resources, if any, these nodes are lacking. The only thing you have mentioned is memory consumption. All Unices today will dump cache pages if a process needs memory space and will instantly reallocate it. If the bulk of the RAM on these systems is consumed by disk cache, you don't have a problem. If the "load" you mentioned is caused by something other then memory usage, then can you please show detail of such? Could you at least provide a snapshot of top output from one pop and one imap machine?

I feel like I'm pulling teeth here. You've made two posts about this issue and provided zero technical detail in either. Make this easier on both of us, and post some darn detail.

-- Stan

lists＠truthisfreedom.org.uk

4:20 p.m.

Hi Stan,

Quoting Stan Hoeppner <stan@hardwarefreak.com>:

...

On 7/11/2011 4:28 AM, lists@truthisfreedom.org.uk wrote:

...
Quoting Stan Hoeppner <stan@hardwarefreak.com>: This still doesn't provide us with the necessary information to give you an intelligent answer to your question.

Sorry, I thought I'd given quite a large amount of detail so far.

To answer the questions I believe were in your analogy:

All the servers are made by the same manufacturer (Dell)
They are all the same model (R410)
The have the same engine (24 cores, 24G RAM, SAS Drives)
The motorway is exactly the same for all servers (NFS to a NetApp
6080 and a RAMSAN)
The weather is almost exactly the same (Same Datacentre, different
rooms/racks)
The Driver is exactly the same (Dovecot 1.0.15)

...

...
The vast majority of the RAM usage is cache, however there is still a discrepancy between the IMAP servers and the POP3 servers.

A discrepancy where? RAM usage by the pop and imap processes? Is there any reason why you didn't post the actual data?

I thought I had explained this, but obviously not.

The discrepancies lie in two areas:

Load Average
RAM Usage (particularly in regard to cache)

In both cases, the value for each area is higher on the three nodes
running POP3 than the nodes running IMAP.

...

...
I guess all I'm really after knowing is if there is a reason why this is the case so I can put my mind (and those of my team!) at ease before we start making other changes to the infrastructure - the last thing I want to do is increase the load on these nodes and watch them die because they didn't have enough resources.

You still have not demonstrated what resources, if any, these nodes are lacking. The only thing you have mentioned is memory consumption. All Unices today will dump cache pages if a process needs memory space and will instantly reallocate it. If the bulk of the RAM on these systems is consumed by disk cache, you don't have a problem. If the "load" you mentioned is caused by something other then memory usage, then can you please show detail of such? Could you at least provide a snapshot of top output from one pop and one imap machine?

POP3: https://gist.github.com/1075816 IMAP: https://gist.github.com/1075821

Unfortunately I can't provide access to the Munin Graphs owing to
company policies, however I'm happy to post the output of pretty much
any command (except rm -rf ;) ) that you would like to see.

I hope that's enough detail, if not please let me know.

Thanks again,

Matt

Stan Hoeppner

6:03 p.m.

On 7/11/2011 8:20 AM, lists@truthisfreedom.org.uk wrote:

...

Hi Stan,

Quoting Stan Hoeppner <stan@hardwarefreak.com>:

...
On 7/11/2011 4:28 AM, lists@truthisfreedom.org.uk wrote:

...
Quoting Stan Hoeppner <stan@hardwarefreak.com>: This still doesn't provide us with the necessary information to give you an intelligent answer to your question.

Sorry, I thought I'd given quite a large amount of detail so far.

To answer the questions I believe were in your analogy:

All the servers are made by the same manufacturer (Dell)

They are all the same model (R410)

The have the same engine (24 cores, 24G RAM, SAS Drives)

The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs you see is the result of HyperThreading being enabled. I'd disable HT if I were you, or those boxen mine.

...

The motorway is exactly the same for all servers (NFS to a NetApp 6080 and a RAMSAN)

The weather is almost exactly the same (Same Datacentre, different rooms/racks)

The Driver is exactly the same (Dovecot 1.0.15)

What operating system? Linux or *BSD? If Linux, what kernel version? Given that you're running Dovecot 1.0.15 I'm guessing you're using CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx. 2.6.18 is 5 years old now and not inappropriate for a modern 2 socket, 6 core HyperThreading box. You need a much newer kernel, preferably in the 2.6.3x series. 2.6.18 could be reporting incorrect load numbers on these machines.

...

...
...
The vast majority of the RAM usage is cache, however there is still a discrepancy between the IMAP servers and the POP3 servers.

It doesn't show in the top snapshots.

...

...
A discrepancy where? RAM usage by the pop and imap processes? Is there any reason why you didn't post the actual data?

I thought I had explained this, but obviously not.

The discrepancies lie in two areas:

Load Average

On Linux, load average strictly shows total system CPU usage in intervals, nothing else. Neither memory, disk, nor network or anything else affects load average. Thus, with a 12 core system, until you see a load average above 12 you have absolutely nothing to worry about. With HT enabled load averages pretty much go out the window as half the "CPUs" are merely glorified duplicate register file phantoms.

Given that all mail apps are 100% IO bound, never CPU or memory bound, I'd guess you'll never see a load average over 4.00 on any of these machines with less than 1000 concurrent connections. This assuming you run a newer kernel and with HT disabled. In other words, no more than 4 cores worth of CPU time will ever be eaten by your workload. What number do your Munin graphs show for load average for each set of boxes? Do they even come close to 4?

Also note that TCP stack processing on the pop nodes will be greater than that of the imap boxes, eating more CPU cycles. More data sent over the wire means more packets, more packets means more CPU time in both code/data processing and interrupts. If you're running iptables rules on each host that bumps up network processing cycles a bit more yet.

...

RAM Usage (particularly in regard to cache)

...

In both cases, the value for each area is higher on the three nodes running POP3 than the nodes running IMAP.

Almost all the memory consumption on both systems is buffer cache. Thus you don't have a memory issue on either host. The kernel will free and immediately reassign pages from cache to application processes as needed. I don't see evidence of the pop machine using more memory, in fact the imap processes are using more. Both boxes are just under 24GB total usage and both using right at 20GB of cache. Looks like a default config Linux kernel based on the ultra aggressive caching and eating up nearly all memory.

...

...
...
I guess all I'm really after knowing is if there is a reason why this is the case so I can put my mind (and those of my team!) at ease before we start making other changes to the infrastructure - the last thing I want to do is increase the load on these nodes and watch them die because they didn't have enough resources.

You still have not demonstrated what resources, if any, these nodes are lacking. The only thing you have mentioned is memory consumption. All Unices today will dump cache pages if a process needs memory space and will instantly reallocate it. If the bulk of the RAM on these systems is consumed by disk cache, you don't have a problem. If the "load" you mentioned is caused by something other then memory usage, then can you please show detail of such? Could you at least provide a snapshot of top output from one pop and one imap machine?

POP3: https://gist.github.com/1075816 IMAP: https://gist.github.com/1075821

Unfortunately I can't provide access to the Munin Graphs owing to company policies, however I'm happy to post the output of pretty much any command (except rm -rf ;) ) that you would like to see.

I hope that's enough detail, if not please let me know.

It may have been. I'll know when you post your load numbers from those top secret graphs. ;)

-- Stan

lists＠truthisfreedom.org.uk

7:22 p.m.

...

...

All the servers are made by the same manufacturer (Dell)

They are all the same model (R410)

The have the same engine (24 cores, 24G RAM, SAS Drives)

The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs you see is the result of HyperThreading being enabled. I'd disable HT if I were you, or those boxen mine.

OK, I'll take a look at this, thanks.

...

...

The motorway is exactly the same for all servers (NFS to a NetApp 6080 and a RAMSAN)

The weather is almost exactly the same (Same Datacentre, different rooms/racks)

The Driver is exactly the same (Dovecot 1.0.15)

What operating system? Linux or *BSD? If Linux, what kernel version? Given that you're running Dovecot 1.0.15 I'm guessing you're using CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx. 2.6.18 is 5 years old now and not inappropriate for a modern 2 socket, 6 core HyperThreading box. You need a much newer kernel, preferably in the 2.6.3x series. 2.6.18 could be reporting incorrect load numbers on these machines.

Linux, Centos 5.6 and (yup, you've guessed it...) 2.6.18 again, I'll
take a look at this, thanks.

...

...

Load Average

On Linux, load average strictly shows total system CPU usage in intervals, nothing else. Neither memory, disk, nor network or anything else affects load average. Thus, with a 12 core system, until you see a load average above 12 you have absolutely nothing to worry about. With HT enabled load averages pretty much go out the window as half the "CPUs" are merely glorified duplicate register file phantoms.

Given that all mail apps are 100% IO bound, never CPU or memory bound, I'd guess you'll never see a load average over 4.00 on any of these machines with less than 1000 concurrent connections. This assuming you run a newer kernel and with HT disabled. In other words, no more than 4 cores worth of CPU time will ever be eaten by your workload. What number do your Munin graphs show for load average for each set of boxes? Do they even come close to 4?

They're showing as between 20 and 24 for the POP3 servers and 1.4 for
the IMAP servers.

...

Also note that TCP stack processing on the pop nodes will be greater than that of the imap boxes, eating more CPU cycles. More data sent over the wire means more packets, more packets means more CPU time in both code/data processing and interrupts. If you're running iptables rules on each host that bumps up network processing cycles a bit more yet.

OK, I'll take a look at that as well

...

...

RAM Usage (particularly in regard to cache)

...
In both cases, the value for each area is higher on the three nodes running POP3 than the nodes running IMAP.

Almost all the memory consumption on both systems is buffer cache. Thus you don't have a memory issue on either host. The kernel will free and immediately reassign pages from cache to application processes as needed. I don't see evidence of the pop machine using more memory, in fact the imap processes are using more. Both boxes are just under 24GB total usage and both using right at 20GB of cache. Looks like a default config Linux kernel based on the ultra aggressive caching and eating up nearly all memory.

So a kernel update is more than sensible...

...

It may have been. I'll know when you post your load numbers from those top secret graphs. ;)

LOL, see above.

Thanks again,

Matt

Stan Hoeppner

9:47 p.m.

On 7/11/2011 11:22 AM, lists@truthisfreedom.org.uk wrote:

...

They're showing as between 20 and 24 for the POP3 servers and 1.4 for the IMAP servers.

FULL STOP. Oh my lordy. Something is ridiculously wrong here. You have 12 physical cores with only ~600 simultaneous pop connections. That's only 50 per core. Even if those are the 'lowly' 2.4GHz 5645 chips each core should be able to handle a couple hundred pop connections. If you were truly hitting an actual load of 20-24, a single one of those boxes would be bringing your NetApp to its knees (assuming GbE) due to the amount of IO that would be taking place with the CPUs this busy.

...

So a kernel update is more than sensible...

Disable HT regardless of kernel upgrading. See if it helps the load issue with the current kernel. Then go ahead and upgrade the kernel. If the CentOS repos don't have anything in the 2.6.3x series grab: http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.39.3.tar.bz2

and roll your own. Though I'd guess since you're a CentOS user you probably don't have any experience rolling kernels.

Now, considering you're running a many years old version of Dovecot, which is no longer officially supported, you really need to upgrade. Safe bet is to grab the latest 1.2.x RPM you can get.

...

...
It may have been. I'll know when you post your load numbers from those top secret graphs. ;)

LOL, see above.

Thanks again,

You're welcome. Your problem isn't solved yet, but it soon will be. :)

-- Stan

Matthew Macdonald-Wallace

10:57 p.m.

On Mon, 2011-07-11 at 13:47 -0500, Stan Hoeppner wrote:

...

On 7/11/2011 11:22 AM, lists@truthisfreedom.org.uk wrote:

...
They're showing as between 20 and 24 for the POP3 servers and 1.4 for the IMAP servers.

FULL STOP. Oh my lordy. Something is ridiculously wrong here. You have 12 physical cores with only ~600 simultaneous pop connections. That's only 50 per core. Even if those are the 'lowly' 2.4GHz 5645 chips each core should be able to handle a couple hundred pop connections. If you were truly hitting an actual load of 20-24, a single one of those boxes would be bringing your NetApp to its knees (assuming GbE) due to the amount of IO that would be taking place with the CPUs this busy.

Good, so my assumption that something was wrong was correct and as the NetApp isn't on its knees...

...

...
So a kernel update is more than sensible...

Disable HT regardless of kernel upgrading. See if it helps the load issue with the current kernel. Then go ahead and upgrade the kernel. If the CentOS repos don't have anything in the 2.6.3x series grab: http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.39.3.tar.bz2

and roll your own. Though I'd guess since you're a CentOS user you probably don't have any experience rolling kernels.

LOL. I'm not a fan of Centos but it's what we've got to play with here - We'll be running Debian (or possibly even Gentoo if I have my way...) on the next load of servers and custom kernels aren't an issue.

/me misses stage one gentoo installs... :(

...

Now, considering you're running a many years old version of Dovecot, which is no longer officially supported, you really need to upgrade. Safe bet is to grab the latest 1.2.x RPM you can get.

We've built our own RPMS for 1.2 - we're upgrading these servers tomorrow... :)

Kind regards,

Matt

Michael Orlitzky

11:12 p.m.

On 07/11/11 15:57, Matthew Macdonald-Wallace wrote:

...

LOL. I'm not a fan of Centos but it's what we've got to play with here - We'll be running Debian (or possibly even Gentoo if I have my way...) on the next load of servers and custom kernels aren't an issue.

Just don't tell anyone:

http://www.gentoo.org/proj/en/gentoo-alt/prefix/

Benny Pedersen

12 Jul 12 Jul

12:26 a.m.

On Mon, 11 Jul 2011 20:57:36 +0100, Matthew Macdonald-Wallace wrote:

...

LOL. I'm not a fan of Centos but it's what we've got to play with here - We'll be running Debian (or possibly even Gentoo if I have my way...) on the next load of servers and custom kernels aren't an issue.

/me misses stage one gentoo installs... :(

why ?, note stage3 problems in current gentoo is nearly solved now, and funtoo use rpms

but is have a openrc / baselayout 2.x where / is mounted 2 times in mtab seems to not accept ext4 and failback to ext2 in runtime :(

i consider this a gentoo bug, but devs have another appinion :/

to not be completely offtopic, will dovecot 2.x support dovecot -n > new-config.conf so its easy to migrade over when time comes ?

Stan Hoeppner

8:21 a.m.

On 7/11/2011 2:57 PM, Matthew Macdonald-Wallace wrote:

...

On Mon, 2011-07-11 at 13:47 -0500, Stan Hoeppner wrote:

...
On 7/11/2011 11:22 AM, lists@truthisfreedom.org.uk wrote:

...
They're showing as between 20 and 24 for the POP3 servers and 1.4 for the IMAP servers.

FULL STOP. Oh my lordy. Something is ridiculously wrong here. You have 12 physical cores with only ~600 simultaneous pop connections. That's only 50 per core. Even if those are the 'lowly' 2.4GHz 5645 chips each core should be able to handle a couple hundred pop connections. If you were truly hitting an actual load of 20-24, a single one of those boxes would be bringing your NetApp to its knees (assuming GbE) due to the amount of IO that would be taking place with the CPUs this busy.

Good, so my assumption that something was wrong was correct and as the NetApp isn't on its knees...

...
...
So a kernel update is more than sensible...

Disable HT regardless of kernel upgrading. See if it helps the load issue with the current kernel. Then go ahead and upgrade the kernel. If the CentOS repos don't have anything in the 2.6.3x series grab: http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.39.3.tar.bz2

Are these virtual machines? You didn't state so previously. Running 2.6.18 as a VM guest on these machines may also be part of the incorrect load reporting problem. If so, run the data collector daemon inside the hypervisor itself so you get actual load figures. You'll never get accurate performance metrics for a whole box from a kernel/daemon inside a VM guest.

-- Stan

Matthew Macdonald-Wallace

9:03 a.m.

On Tue, 2011-07-12 at 00:21 -0500, Stan Hoeppner wrote:

...

Are these virtual machines? You didn't state so previously. Running 2.6.18 as a VM guest on these machines may also be part of the incorrect load reporting problem. If so, run the data collector daemon inside the hypervisor itself so you get actual load figures. You'll never get accurate performance metrics for a whole box from a kernel/daemon inside a VM guest.

Nope, all on the bare metal in our own datacentre.

I'll let you know how the HT switchoff goes.

Matthew Macdonald-Wallace

20 Jul 20 Jul

9:27 p.m.

On Tue, 2011-07-12 at 07:03 +0100, Matthew Macdonald-Wallace wrote:

...

On Tue, 2011-07-12 at 00:21 -0500, Stan Hoeppner wrote:

...
Are these virtual machines? You didn't state so previously. Running 2.6.18 as a VM guest on these machines may also be part of the incorrect load reporting problem. If so, run the data collector daemon inside the hypervisor itself so you get actual load figures. You'll never get accurate performance metrics for a whole box from a kernel/daemon inside a VM guest.

Nope, all on the bare metal in our own datacentre.

I'll let you know how the HT switchoff goes.

M.

Hi all,

Just to let you all know that once we upgraded to Dovecot 1.2 (and enabled attribute caching on the NFS devices!) the loads settled down.

We've seen a drop in 15 minute load average from 12 to 2 and a drop in NFS I/O from 30K I/OPS to 3K I/OPS whilst continuing to serve the same number of queries - quite a difference!

Thanks to all who helped,

Kind regards,

Matt

Stan Hoeppner

21 Jul 21 Jul

3:40 p.m.

On 7/20/2011 1:27 PM, Matthew Macdonald-Wallace wrote:

...

On Tue, 2011-07-12 at 07:03 +0100, Matthew Macdonald-Wallace wrote:

...
On Tue, 2011-07-12 at 00:21 -0500, Stan Hoeppner wrote:

...
Are these virtual machines? You didn't state so previously. Running 2.6.18 as a VM guest on these machines may also be part of the incorrect load reporting problem. If so, run the data collector daemon inside the hypervisor itself so you get actual load figures. You'll never get accurate performance metrics for a whole box from a kernel/daemon inside a VM guest.

Nope, all on the bare metal in our own datacentre.

I'll let you know how the HT switchoff goes.

M.

Hi all,

Just to let you all know that once we upgraded to Dovecot 1.2 (and enabled attribute caching on the NFS devices!) the loads settled down.

If you're running with NFS caching enabled in v1.x you need to read:

http://wiki.dovecot.org/NFS

...

We've seen a drop in 15 minute load average from 12 to 2 and a drop in NFS I/O from 30K I/OPS to 3K I/OPS whilst continuing to serve the same number of queries - quite a difference!

Great. Glad to see you're making some headway.

Worth noting, this is the first time in this thread that you've mentioned your NFS load. Up to now you mentioned only CPU and memory consumption as problem areas.

...

Thanks to all who helped,

The suggestion to upgrade to 1.2 was made very early on. Which helped more, v1.2 or enabling NFS caching?

Also, did you test any machines with hyper-threading disabled? If so, what effect did it have, if any?

-- Stan

Rainer Frey

12 Jul 12 Jul

9:12 a.m.

On 11.07.2011, at 17:03, Stan Hoeppner wrote:

...

The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs you see is the result of HyperThreading being enabled. I'd disable HT if I were you, or those boxen mine.

Why?

Rainer

Stan Hoeppner

9:59 a.m.

On 7/12/2011 1:12 AM, Rainer Frey wrote:

...

On 11.07.2011, at 17:03, Stan Hoeppner wrote:

...
The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs you see is the result of HyperThreading being enabled. I'd disable HT if I were you, or those boxen mine.

Why?

It's a troubleshooting step. HT can cause weird problems with some systems/kernels. It can also decrease performance with some workloads. As with anything, if it doesn't provide benefit, turn if off to reduce complexity and potential problems.

-- Stan

Miquel van Smoorenburg

16 Jul 16 Jul

1:40 a.m.

On 11-07-11 5:03 PM, Stan Hoeppner wrote:

...

Given that you're running Dovecot 1.0.15 I'm guessing you're using CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx. 2.6.18 is 5 years old now and not inappropriate for a modern 2 socket, 6 core HyperThreading box. You need a much newer kernel, preferably in the 2.6.3x series. 2.6.18 could be reporting incorrect load numbers on these machines.

RHEL kernel version numbers do not say much. The redhat 2.6.18 is 2.6.18

a boatload of "enterprise load" patches and backports from 2.6.2x. OTOH, dovecot 1.0.15 is ancient indeed :)

...

...
The discrepancies lie in two areas:

Load Average

On Linux, load average strictly shows total system CPU usage in intervals, nothing else.

That would be FreeBSD, AFAIK. On linux, I/O does add to the load average. A process in state 'D' (Disk wait, could be NFS wait too btw) adds '1' to the load. If you have a broken NFS server and 2000 processes waiting on I/O, the reported load will go over 2000.

You get a better impression of system load by running 'top' and paying attention to the number on the 'cpu' line: us == time spent in user process, sy = kernel, id = idle, wa = I/O wait, si = interrupts

Press '1' while in top to expand the view to all CPUs seperately. Quite enlightening.

...

Given that all mail apps are 100% IO bound, never CPU or memory bound, I'd guess you'll never see a load average over 4.00 on any of these machines with less than 1000 concurrent connections.

Well, see above. Also, if you have SSL enabled, the crypto will actually eat quite a bit of CPU if you have a lot of network traffic.

Mike.

Noel Butler

3:47 a.m.

On Sat, 2011-07-16 at 00:40 +0200, Miquel van Smoorenburg wrote:

...

On 11-07-11 5:03 PM, Stan Hoeppner wrote:

...
On Linux, load average strictly shows total system CPU usage in intervals, nothing else.

...

That would be FreeBSD, AFAIK. On linux, I/O does add to the load

You're right Miquel, I/O adds to load in Linux, has done for many years.

5222

Age (days ago)

5235

Last active (days ago)

List overview

19 comments

8 participants

participants (8)

Benny Pedersen
lists＠truthisfreedom.org.uk
Matthew Macdonald-Wallace
Michael Orlitzky
Miquel van Smoorenburg
Noel Butler
Rainer Frey
Stan Hoeppner