In-depth Server Hardware Troubleshooting
Catagorizes hardware component’s pass/fail conditions, and how to achieve them.
Component Type | Pass Condition | Used Condition | Fail Condition |
---|---|---|---|
Processor |
|
|
|
Memory |
|
|
|
Power Supply (non-redundant) |
|
|
|
Redundant Power Supply (DO NOT TEST STANDALONE, ONLY IN COMPATIBLE BAREBONE) |
|
|
|
Drive (HDD/SSD/U.2 SSD) |
|
|
|
Drive (NVMe M.2/PCIe SSD) |
|
|
|
Graphics Card (non-computational/has display ports) |
|
|
|
Graphics Card (Computational/passive/no display ports; i.e. Tesla series) |
|
|
|
Network Card (NIC) |
|
|
|
Chassis |
|
|
|
Detailed steps and tests by component category
General Hardware testing
https://exxact.atlassian.net/wiki/spaces/WAR/pages/789384865
https://exxact.atlassian.net/wiki/spaces/WAR/pages/789417550
CPU
CPU/Memory
Graphics Card
<Need, article for Furmark>
<Need, article for outputting nvidia-bug-report, this is necessary for any RMA’s with NVIDIA or Supermicro)
Drives
<Need, article for Windows tool we use for drive testing>
Network Cards (NIC)
<Need, article for outputting NIC specs, and explaining how to check negotiating speeds>
<Probably ethtool guide for both CentOS and Ubuntu>
<IB commands for Infiniband cards>
Additional Notes
When pulling possibly sellable hardware from physically damaged system(s):
Physically inspecting damaged systems and parts
If we have to troubleshoot HGX-2 hardware again: