Using AMD’s mcat.exe to Debug your PSOD MCE (Machine Check Exception)

tamarian[1] "Sokath, his eyes opened" or roughly “Understanding”. So what does the Tamarian language have to do with PSODs or Machine Check Exceptions (MCEs)? Well, neither one of them make much sense, and need some understanding in order to translate them appropriately.

What is an MCE (Machine Check Exception)

A machine check exception, or MCE is the systems way of throwing a hardware error up through the operating system when the error is severe enough to warrant a system halt. On ESX these look similar to:

[45m[33;1mVMware ESX Server [Releasebuild-113339][0m
Machine Check Exception: Unable to continue
frame=0x3ad3d2c ip=0x625eb0 cr2=0xff400000 cr3=0x3f737000 cr4=0x168
es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
ebp=0x3ad3e88 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff
0:1024/console *1:1076/vmware-vm 2:1120/vmware-vm 3:1121/vmm0:2551
4:1128/mks:20480 5:1084/mks:20480 6:1126/vmm0:2048 7:1108/vmm0:2308
@BlueScreen: Machine Check Exception: Unable to continue
0x3ad3e88:[0x625eb0]Panic+0x17 stack: 0x8424f0, 0x3ad3ea4, 0x3ad3eb0
0x3ad3e98:[0x625eb0]Panic+0x17 stack: 0x8424f0, 0x0, 0x0
0x3ad3eb0:[0x6667f8]MCE_HandleException+0x6b stack: 0x3ad3ef8, 0xbf5feaeb, 0x3ad3f20
0x3ad3ec0:[0x62093d]Int18_MachineCheck+0x4c stack: 0x3ad3ef8, 0x4028, 0x4028
0x3ad3f20:[0x692cac]CommonTrap+0xb stack: 0x23, 0xbf5feaea, 0xc1e40ee
0x3ad3f3c:[0x7024a7]User_CopyOut+0x52 stack: 0xbf5feaea, 0xc1e40ee, 0x2
0x3ad3f74:[0x722975]LinuxFileDesc_Poll+0x120 stack: 0xbf5feae4, 0x10, 0x64
0x3ad3fa8:[0x70304b]User_LinuxSyscallHandler+0x6a stack: 0x3ad3fe0, 0x23, 0x23
0xbf5fda98:[0x692cac]CommonTrap+0xb stack: 0x0, 0x0, 0x0
VMK uptime: 169:17:26:00.887 TSC: 27792623958323601
169:17:26:00.884 cpu1:1076)MCE: 169: Machine Check Exception: General Status 0000000000000004
169:17:26:00.884 cpu1:1076)MCE: 193: Machine Check Exception: Bank 0, Status b673400000000145
169:17:26:00.884 cpu1:1076)MCE: 226: Machine Check Exception: Bank 0, Addr 00000000206e38e0, Valid TRUE

Now you start to see the similarities to the Tamarian language, no? Well perhaps the Star Trek metaphor doesn’t work for you, but you can agree that the above is a little obtuse. How do we read it then?

Translating Using mcat.exe

Glad you asked… about translation that is. For both AMD and Intel Platforms, this VMware KB provides excellent detail and guidance. If you are running an AMD platform, you can use some additional toolage (yes, it is a word!). First go download and install mcat.exe from the AMD site. Running mcat.exe /? gives us quite a bit of output. I’ve included the bits that are relevant to us below:

Machine Check Analysis Tool (MCAT) Version 1.1.10
USAGE:
   mcat /? [/cmd] bank status address misc] |
where
   /cmd
      bank      MCA error bank number
      status    MCA error status register value (prefix with 0x for hex)
      address   MCA error address register value (prefix with 0x for hex)
      misc      MCA error misc register value (prefix with 0x for hex) 
/cmd   Decode bank, status, address, misc provided in command line

You can see we’re primarily concerned with the /cmd switch, but where do we get the parameters to feed it? They’re in our PSOD message… These lines specifically:

169:17:26:00.884 cpu1:1076)MCE: 193: Machine Check Exception: Bank 0, Status b673400000000145
169:17:26:00.884 cpu1:1076)MCE: 226: Machine Check Exception: Bank 0, Addr 00000000206e38e0, Valid TRUE

On the command line it looks like this:

C:\Program Files\AMD\MCAT>mcat /cmd 0 0xb673400000000145 0x00000000206e38e0 0
Processor Number  : Unknown
Bank Number       : 0
Time Stamp    (0x): 00000000 00000000
Error Status  (0x): B6734000 00000145
Error Address (0x): 00000000 206E38E0
Error Misc.   (0x): 00000000 00000000
Status Bit Decode:
   Correctable ECC error
   Processor state corrupted by error
   Error address valid in MCi_ADDR
   Error reporting enabled
   Error not corrected
   Error valid
Memory Error Code:
   Memory transaction type: Data write (DWR)
   Transaction type: Data
   Cache level: Level 1 (L1)
Data Cache Error MC0:
   Data array Store (DWR) error on Level 1 (L1) data cache
   Syndrome: 0xE6

A bit less obtuse this time. Reading it over, we find out that there was a “correctable ECC error”, but in the next line that “Processor state corrupted by error” occurred, indicating we may have lost our specific error in memory. At this point, your best bet is to schedule a maintenance window, and run Memtest86 or a similar diagnostic to rule RAM out of the equation. Once ram is ruled out, contact your hardware vendor for some replacement procs.

Questions? Comments? Drop a line in the comments.

One thought on “Using AMD’s mcat.exe to Debug your PSOD MCE (Machine Check Exception)”

Comments are closed.