kernel: testing NMI watchdog … <4>WARNING: CPU#0: NMI appears to be stuck
This is what I was greeted with this morning on a 64bit RHEL 5 VM. After looking around the internet a bit, I found that this was a guest OS problem that was caused but a number of things.
First, was the resource pool setup on this cluster, this VM was in the “Low” resource pool, having 2000 shares for CPU. It also shares this pool with 5 other VM’s and 12 VM’s in it’s cluster.
Second, the cluster, including two hosts, both with dual quad cores (16 total cores), has exactly that number of vCPU’s allocated (Knowing this is bad… but alas, you can’t win them all).
Third, over the weekend, the server was powered down, had it’s VCPUs changed down from 4 vCPUs to 2 vCPUs.
It seems that during this process, and the reboots of all of the 12 other VM’s on the cluster (they were all having CPU’s changed) caused some scheduling contention. This cluster is not in the clear yet, but coming from having 24 allocated vCPU’s down to 16 was a good start.
The VM, once the processor contention went away, started operating normally again. What’s the morale of the story? Friends don’t let friends over commit CPU.