Post-OS Hardware Troubleshooting Guide

Table of Contents


Burn-in tests used and suggested

mprime:Ā https://www.mersenne.org/download/

GPU standalone validation:Ā GPU Validation Test

Symptoms observed through hardware burn-in testing

Servers

Component Type
Method
Fail condition
Other possible factors
Processor
  • mprime for Linux
  • prime95 for Windows
  • System unexpectedly reboots, and reports multiple Memory DIMM slots
  • CPU overheats
    • Threshold: 80c-90c for most Broadwell Xeons; 75-80c for most Skylake/Scalable
    • Idle: 45c-50c is high idle temps
  • Environment
  • Debris/buildup on chassisĀ 
  • Chassis fans
Memory
  • mprime for Linux (monitored through EDAC/IPMI)
  • prime95 for Windows (monitored through 'system' event logs or IPMI)
  • System unexpectedly reboots, and offending Memory DIMM is logged in BMC (or Windows) event log

Graphics Card

  • Standalone Validation Test for Linux (log files with consistent/matching values)
  • Windows 'binom' Tests (monitored through 'NVSMI' via CMD or event logs)
  • GPU will fall off BUS or report ERR! when monitoring 'nvidia-smi' (Linux); event may be reported in /var/log
  • One or more GPU's significantly higher temp than others (92-95c)
  • PCI-e device reported in Windows 'system' event log 'warnings'/'errors' (Windows)
  • Motherboard
  • Processor
Power Supply
  • Running CPU, Memory, and GPU test at the same time
  • (Assuming correct outlet is being used) System unexpectedly powers off and does not reboot; logged in BMC event log

Motherboard
  • Critical motherboard issues will not POST or provide any display
  • No power LED displaying on board when power supply is known-working and plugged in

DevBox

Component Type
Method
Fail condition
Other possible factors
Processor
  • mprime for Linux
  • prime95 for Windows
  • System unexpectedly reboots; even under minimal POST configuration and cycled through all memory DIMM's
  • CPU overheats
    • Threshold: 80c-90c for most Broadwell Xeons; 75-80c for most Skylake/Scalable
    • Idle: 45c-50c is high idle temps
  • Environment
  • Debris/buildup on chassisĀ 
  • CPU liquid cooler
Memory
  • mprime for Linux (monitored through EDAC/IPMI)
  • prime95 for Windows (monitored through 'system' event logs or IPMI)
  • System unexpectedly reboots, and offending Memory DIMM might be logged in /var/log or Windows 'event logs'
  • Motherboard
  • Processor
Graphics Card
  • Standalone Validation Test for Linux (log files with consistent/matching values)
  • Windows 'binom' Tests (monitored through 'NVSMI' via CMD or event logs)
  • GPU will fall off BUS or report ERR! when monitoring 'nvidia-smi' (Linux); event may be reported in /var/log
  • One or more GPU's significantly higher temp than others (92-95c)
  • PCI-e device reported in Windows 'system' event log 'warnings'/'errors' (Windows)
  • Motherboard
  • Processor
Power Supply
  • Running CPU, Memory, and GPU test at the same time
  • (Assuming correct outlet is being used) System unexpectedly powers off and does not reboot

Motherboard
  • No power LED displaying on board when power supply is known-working and plugged in