- All the servers are made by the same manufacturer (Dell)
- They are all the same model (R410)
- The have the same engine (24 cores, 24G RAM, SAS Drives)
The R410 is a two socket Xeon box with max 2 x 6 core CPUs. The 24 CPUs you see is the result of HyperThreading being enabled. I'd disable HT if I were you, or those boxen mine.
OK, I'll take a look at this, thanks.
- The motorway is exactly the same for all servers (NFS to a NetApp 6080 and a RAMSAN)
- The weather is almost exactly the same (Same Datacentre, different rooms/racks)
- The Driver is exactly the same (Dovecot 1.0.15)
What operating system? Linux or *BSD? If Linux, what kernel version? Given that you're running Dovecot 1.0.15 I'm guessing you're using CentOS or RHEL 5.x and thus have kernel 2.6.18-xxx. 2.6.18 is 5 years old now and not inappropriate for a modern 2 socket, 6 core HyperThreading box. You need a much newer kernel, preferably in the 2.6.3x series. 2.6.18 could be reporting incorrect load numbers on these machines.
Linux, Centos 5.6 and (yup, you've guessed it...) 2.6.18 again, I'll
take a look at this, thanks.
- Load Average
On Linux, load average strictly shows total system CPU usage in intervals, nothing else. Neither memory, disk, nor network or anything else affects load average. Thus, with a 12 core system, until you see a load average above 12 you have absolutely nothing to worry about. With HT enabled load averages pretty much go out the window as half the "CPUs" are merely glorified duplicate register file phantoms.
Given that all mail apps are 100% IO bound, never CPU or memory bound, I'd guess you'll never see a load average over 4.00 on any of these machines with less than 1000 concurrent connections. This assuming you run a newer kernel and with HT disabled. In other words, no more than 4 cores worth of CPU time will ever be eaten by your workload. What number do your Munin graphs show for load average for each set of boxes? Do they even come close to 4?
They're showing as between 20 and 24 for the POP3 servers and 1.4 for
the IMAP servers.
Also note that TCP stack processing on the pop nodes will be greater than that of the imap boxes, eating more CPU cycles. More data sent over the wire means more packets, more packets means more CPU time in both code/data processing and interrupts. If you're running iptables rules on each host that bumps up network processing cycles a bit more yet.
OK, I'll take a look at that as well
- RAM Usage (particularly in regard to cache)
In both cases, the value for each area is higher on the three nodes running POP3 than the nodes running IMAP.
Almost all the memory consumption on both systems is buffer cache. Thus you don't have a memory issue on either host. The kernel will free and immediately reassign pages from cache to application processes as needed. I don't see evidence of the pop machine using more memory, in fact the imap processes are using more. Both boxes are just under 24GB total usage and both using right at 20GB of cache. Looks like a default config Linux kernel based on the ultra aggressive caching and eating up nearly all memory.
So a kernel update is more than sensible...
It may have been. I'll know when you post your load numbers from those top secret graphs. ;)
LOL, see above.
Thanks again,
Matt