Table of Contents
Burn-in tests used and suggested
mprime:Ā https://www.mersenne.org/download/
GPU standalone validation:Ā GPU Validation Test
Symptoms observed through hardware burn-in testing
Servers
Processor | - mprime for Linux
- prime95 for Windows
| - System unexpectedly reboots, and reports multiple Memory DIMM slots
- CPU overheats
- Threshold: 80c-90c for most Broadwell Xeons; 75-80c for most Skylake/Scalable
- Idle: 45c-50c is high idle temps
| - Environment
- Debris/buildup on chassisĀ
- Chassis fans
|
Memory | - mprime for Linux (monitored through EDAC/IPMI)
- prime95 for Windows (monitored through 'system' event logs or IPMI)
| - System unexpectedly reboots, and offending Memory DIMM is logged in BMC (or Windows) event log
|
|
Graphics Card | - Standalone Validation Test for Linux (log files with consistent/matching values)
- Windows 'binom' Tests (monitored through 'NVSMI' via CMD or event logs)
| - GPU will fall off BUS or report ERR! when monitoring 'nvidia-smi' (Linux); event may be reported in /var/log
- One or more GPU's significantly higher temp than others (92-95c)
- PCI-e device reported in Windows 'system' event log 'warnings'/'errors' (Windows)
| |
Power Supply | - Running CPU, Memory, and GPU test at the same time
| - (Assuming correct outlet is being used) System unexpectedly powers off and does not reboot; logged in BMC event log
|
|
Motherboard | - Critical motherboard issues will not POST or provide any display
| - No power LED displaying on board when power supply is known-working and plugged in
|
|
DevBox
Processor | - mprime for Linux
- prime95 for Windows
| - System unexpectedly reboots; even under minimal POST configuration and cycled through all memory DIMM's
- CPU overheats
- Threshold: 80c-90c for most Broadwell Xeons; 75-80c for most Skylake/Scalable
- Idle: 45c-50c is high idle temps
| - Environment
- Debris/buildup on chassisĀ
- CPU liquid cooler
|
Memory | - mprime for Linux (monitored through EDAC/IPMI)
- prime95 for Windows (monitored through 'system' event logs or IPMI)
| - System unexpectedly reboots, and offending Memory DIMM might be logged in /var/log or Windows 'event logs'
| |
Graphics Card | - Standalone Validation Test for Linux (log files with consistent/matching values)
- Windows 'binom' Tests (monitored through 'NVSMI' via CMD or event logs)
| - GPU will fall off BUS or report ERR! when monitoring 'nvidia-smi' (Linux); event may be reported in /var/log
- One or more GPU's significantly higher temp than others (92-95c)
- PCI-e device reported in Windows 'system' event log 'warnings'/'errors' (Windows)
| |
Power Supply | - Running CPU, Memory, and GPU test at the same time
| - (Assuming correct outlet is being used) System unexpectedly powers off and does not reboot
|
|
Motherboard |
| - No power LED displaying on board when power supply is known-working and plugged in
|
|