On 7/11/2011 8:20 AM, lists@truthisfreedom.org.uk wrote:
Hi Stan,
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
On 7/11/2011 4:28 AM, lists@truthisfreedom.org.uk wrote:
Quoting Stan Hoeppner <stan@hardwarefreak.com>: This still doesn't provide us with the necessary information to give you an intelligent answer to your question.
Sorry, I thought I'd given quite a large amount of detail so far.
To answer the questions I believe were in your analogy:
- All the servers are made by the same manufacturer (Dell)
- They are all the same model (R410)
- The have the same engine (24 cores, 24G RAM, SAS Drives)
The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs you see is the result of HyperThreading being enabled. I'd disable HT if I were you, or those boxen mine.
- The motorway is exactly the same for all servers (NFS to a NetApp 6080 and a RAMSAN)
- The weather is almost exactly the same (Same Datacentre, different rooms/racks)
- The Driver is exactly the same (Dovecot 1.0.15)
What operating system? Linux or *BSD? If Linux, what kernel version? Given that you're running Dovecot 1.0.15 I'm guessing you're using CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx. 2.6.18 is 5 years old now and not inappropriate for a modern 2 socket, 6 core HyperThreading box. You need a much newer kernel, preferably in the 2.6.3x series. 2.6.18 could be reporting incorrect load numbers on these machines.
The vast majority of the RAM usage is cache, however there is still a discrepancy between the IMAP servers and the POP3 servers.
It doesn't show in the top snapshots.
A discrepancy where? RAM usage by the pop and imap processes? Is there any reason why you didn't post the actual data?
I thought I had explained this, but obviously not.
The discrepancies lie in two areas:
- Load Average
On Linux, load average strictly shows total system CPU usage in intervals, nothing else. Neither memory, disk, nor network or anything else affects load average. Thus, with a 12 core system, until you see a load average above 12 you have absolutely nothing to worry about. With HT enabled load averages pretty much go out the window as half the "CPUs" are merely glorified duplicate register file phantoms.
Given that all mail apps are 100% IO bound, never CPU or memory bound, I'd guess you'll never see a load average over 4.00 on any of these machines with less than 1000 concurrent connections. This assuming you run a newer kernel and with HT disabled. In other words, no more than 4 cores worth of CPU time will ever be eaten by your workload. What number do your Munin graphs show for load average for each set of boxes? Do they even come close to 4?
Also note that TCP stack processing on the pop nodes will be greater than that of the imap boxes, eating more CPU cycles. More data sent over the wire means more packets, more packets means more CPU time in both code/data processing and interrupts. If you're running iptables rules on each host that bumps up network processing cycles a bit more yet.
- RAM Usage (particularly in regard to cache)
In both cases, the value for each area is higher on the three nodes running POP3 than the nodes running IMAP.
Almost all the memory consumption on both systems is buffer cache. Thus you don't have a memory issue on either host. The kernel will free and immediately reassign pages from cache to application processes as needed. I don't see evidence of the pop machine using more memory, in fact the imap processes are using more. Both boxes are just under 24GB total usage and both using right at 20GB of cache. Looks like a default config Linux kernel based on the ultra aggressive caching and eating up nearly all memory.
I guess all I'm really after knowing is if there is a reason why this is the case so I can put my mind (and those of my team!) at ease before we start making other changes to the infrastructure - the last thing I want to do is increase the load on these nodes and watch them die because they didn't have enough resources.
You still have not demonstrated what resources, if any, these nodes are lacking. The only thing you have mentioned is memory consumption. All Unices today will dump cache pages if a process needs memory space and will instantly reallocate it. If the bulk of the RAM on these systems is consumed by disk cache, you don't have a problem. If the "load" you mentioned is caused by something other then memory usage, then can you please show detail of such? Could you at least provide a snapshot of top output from one pop and one imap machine?
POP3: https://gist.github.com/1075816 IMAP: https://gist.github.com/1075821
Unfortunately I can't provide access to the Munin Graphs owing to company policies, however I'm happy to post the output of pretty much any command (except
rm -rf
;) ) that you would like to see.I hope that's enough detail, if not please let me know.
It may have been. I'll know when you post your load numbers from those top secret graphs. ;)
-- Stan