ESXi PSOD due to a PCPU becomes too busy

One of the ESXi hosts failed with a “Purple Screen Of Death” and below analysis found as the root cause of the failure.

It was sitting in a vSphere 5.5 and lower patch level version to 30xxxxx. We were not able to identify any hardware failures or any error related to the server hardware. Also I can confirm that it was configured with the correct drivers.

This is the part of an error logs we found in the failed ESXi host

2017-09-12T05:25:54.232Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:54.433Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:54.633Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:54.832Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:55.034Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:55.235Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:55.434Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:55.634Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:55.833Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:56.032Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:56.231Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:56.429Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:56.628Z cpu37:66166510)MCE: 1118: cpu37: MCA error detected via CMCI (Gbl status=0x0): Restart IP: invalid, Error IP: invalid, MCE in progress: no.
2017-09-12T05:25:56.629Z cpu37:66166510)MCE: 222: cpu37: bank7: status=0xcc000f4000010091: (VAL=1, OVFLW=1, UC=0, EN=0, PCC=0, S=0, AR=0), ECC=no, Addr:0x526e3600 (valid), Misc:0x390261e840 (valid)

 

This was identified as the root cause: PCPU becomes too busy logging all the correctable error messages to perform routine background tasks, leading ESXi to assume that PCPU is unresponsive.

Possible tasks to correct the Error: To fix this PSOD error we had to update the 5.5 Patch version to 3568722, however the latest patch version available to 5.5 is 5230635.

You can read More about this in below KB articles: