In-depth Server Hardware Troubleshooting

Catagorizes hardware component’s pass/fail conditions, and how to achieve them.

Component Type

Pass Condition

Used Condition

Fail Condition

Component Type

Pass Condition

Used Condition

Fail Condition

Processor

  • Can POST to BIOS (provides display) with motherboard, memory, and power supply

  • Any visible scratches on any part of the CPU

  • Thermal paste inside vent hole (this can be cleaned out)

  • Does not allow system POST to BIOS

  • Causes topology discrepancy with known-working parts (i.e. missing Memory DIMM slots when memory is clearly installed and tested in other systems)

Memory

  • Can POST to BIOS (provides display) with motherboard, processor, and power supply

  • Any visible scratches on any part of the Memory DIMM

  • Does not allow system POST to BIOS

  • (IF ECC/DDR5) ECC reports errors for installed DIMM

Power Supply (non-redundant)

  • Can POST to BIOS (provides display) with motherboard, memory, and processor

  • Any visible scratches on any part of the power supply

  • Power cables worn out corners (i.e. someone jamming the cables in)

  • Does not allow system POST to BIOS

  • Voltage detector does not detect power while power supply is plugged into outlet and is switched ‘on’

Redundant Power Supply

(DO NOT TEST STANDALONE, ONLY IN COMPATIBLE BAREBONE)

  • See MFR manual for ‘working’ LED activity (i.e. green LED/blinking)

  • Allows barebone to power on (fans spinning)

  • Any visible scratches on any part of the power supply (guiding ‘wings’ bent, scratch on large flat gold-contacts)

  • See MFR manual for ‘failed’ LED activity (i.e. amber LED/static)

  • Does not allow barebone to power on

Drive (HDD/SSD/U.2 SSD)

  • Recognized in BIOS, OS, and SMART status Good/OK

  • LIGHTLY USED

    • <1000 Power-on Hours

  • Anything above can be categorized as ‘used’, but there are also other factors such as lifetime reads/writes (i.e. lots of writes would take a toll on SSD’s_

  • Raw errors (SMART in cautious/bad)

Drive (NVMe M.2/PCIe SSD)

  • Recognized in BIOS, OS, and SMART status Good/OK

  • NVMe smart-log shows no:

    • no_err_log_entries

  • LIGHTLY USED

    • <1000 Power-on Hours

  • Could be critical temperatures logged, but that could be caused by the system it was installed in, not the drive’s fault

  • Not recognized in BIOS or OS (make sure BIOS is set correctly to see NVMe devices; i.e. ‘legacy’ or ‘ops/bifurcation’ BIOS settings may need to be toggled)

  • NVMe smart-log shows no_err_log_entries (check nvme error-log /dev/nvmeX for specifics)

Graphics Card

(non-computational/has display ports)

  • Provides video display from each port (unless noted otherwise from MFR)

  • Passes Furmark 24h

  • Passes GPU Standalone Validation (10/5/3)

  • Any major scratches on plastic casing or gold contact pins

  • Missing emblems (I’d prefer them missing, but customer’s may see this as a ‘used’ product)

  • Bent video display bezel/bracket (this can be replaced if we have the correct one)

  • Does not provide video display

  • Does not show up in lspci (Linux) or device manager (Windows)

  • Falls off the BUS during GPU stress tests listed in ‘pass condition’

Graphics Card

(Computational/passive/no display ports; i.e. Tesla series)

  • Passes GPU Standalone Validation (10/5/5)

  • Can output nvidia-bug-report

  • Any major scratches on plastic casing or gold contact pins

  • Bent bezel/bracket (this can be replaced if we have the correct one)

  • Does not show up in lspci (Linux)

Network Card (NIC)

  • Provides IP via DHCP

  • Negotiates network speeds as spec’ed (i.e. ethtool <port> | grep -i speed)

  • This is all board usually, so bent bracket or missing LP (low-profile) or FL (full-height) bracket; full kits are usually sold with both

  • Does not show up in lspci (Linux) or device manager (Windows)

  • Does not provide, or drops, IP while system is under full load (for thermal)

Chassis

  • Has all chassis screws and accessories

  • All jumper pins attached and working

  • Obvious scratches, bents, wear on chassis

  • Missing fans

  • Missing accessories

  • Plate for mounting the motherboard is warped or missing stand-offs that prevents secure motherboard installation (motherboard touching aluminum/metal will short the board if improperly mounted)

Detailed steps and tests by component category

General Hardware testing

https://exxact.atlassian.net/wiki/spaces/WAR/pages/789384865

https://exxact.atlassian.net/wiki/spaces/WAR/pages/789417550

CPU

https://exxact.atlassian.net/wiki/spaces/ESKB/pages/2260533261

CPU/Memory

Graphics Card

<Need, article for Furmark>

<Need, article for outputting nvidia-bug-report, this is necessary for any RMA’s with NVIDIA or Supermicro)

Drives

<Need, article for Windows tool we use for drive testing>

Network Cards (NIC)

<Need, article for outputting NIC specs, and explaining how to check negotiating speeds>

<Probably ethtool guide for both CentOS and Ubuntu>

<IB commands for Infiniband cards>

Additional Notes

When pulling possibly sellable hardware from physically damaged system(s):

If we have to troubleshoot HGX-2 hardware again: