Udo Wolter put forth on 11/8/2010 4:45 AM:
- Ralf Hildebrandt <Ralf.Hildebrandt@charite.de>:
And I'm guessing you're running a 32bit PAE kernel because VMWare ESX still doesn't officially support 64bit guests, correct?
No, it's supported, but I don'T want to change the whole system.
That's right, we cannot switch without having several hours downtime. This is not acceptable. I'm thinking of a way for switching to 64 bit with exchanging disks etc. But I don't know if this will work, I have to test it first.
Does this machine have more than 4GB of RAM? You do realize that merely utilizing PAE will cause an increase in context switching, whether on bare medal or in a VM guest. It will probably actually be much higher with a VM guest running a PAE kernel. Also, please tell me the ESX kernel you're running is native 64 bit, not 32 bit. If the VMWare kernel itself is doing PAE, as well as the guest Linux kernel, this may fully explain the performance disaster you have on your hands, if it is indeed due to context switching.
The bigger question is, why does this problem surface so readily while running Dovecot 2.0.x and not while running Dovecot 1.2.x? Is 1.2.x merely tickling the dragon's chin, whereas 2.0.x is sticking it's head into the dragon's mouth?
Is this the only guest on this host or do you have others?
only guest
Yes, the VM-system has 8 CPUs and that's all the ESX has. Of course, there are times, when the ESX doesn't have that much stress so the DRS moves 1 or 2 other machines onto it. But since we got that high load, the rest of the machines all had been moved off the ESX.
If this is the only guest, you have 2 dual core dies in that Xeon CPU, 4 cores total. I assume you've assigned 4 virtual CPUs to this Debian VM?
Yes, something like that
Ralf gave me the model number of that server and said it was a single CPU machine. I looked up the specs, and if that is the case, there are 4 cores total in that Xeon. And, IIRC, that Xeon does not have the HyperThreading circuitry. So, are there two physical CPUs in the machine with 4 cores each, or 1 CPU with 4 cores and HT, appearing as 8 cores? If it's one 4 core CPU with HT enabled, reboot the machine and disable HT in the BIOS. HT itself also contributes to high context switching. HT is more of a hindrance to ESX performance than a benefit.
www.vmware.com/pdf/vi_performance_tuning.pdf
You may want to run top in the hypervisor console itself (or an SSH session into the hypervisor) and watch the %CPU of the hypervisor's kernel threads. That might tell us something as well.
Udo has to answer that, but from what he told me it was fully using all cpus with 2.0, and now it's idling with 1.2
More details to follow (from him)
As I said in the other mail: as long as the load isn't high enough we cannot see any problems in the ESX. Only, if we step over some kind of specific barrier. I think, it's when even the ESX runs out of possibilities to handle so many interrupts.
This very well may be the case. You need to also look at the CONFIG_HZ= value of the Linux kernel of the guest. If it's a tickless kernel you should be fine. If tickless, IIRC, you should see CONFIG_NO_HZ=y.
However, if CONFIG_HZ=1000 you're generating WAY too many interrupts/sec to the timer, ESPECIALLY on an 8 core machine. This will exacerbate the high context switching problem. On an 8 vCPU (and physical CPU) machine you should have CONFIG_HZ=100 or a tickless kernel. You may get by using 250, but anything higher than that is trouble.
-- Stan