Pre-OS Hardware Troubleshooting Guide
- Joey Houy (Unlicensed)
Table of Contents
Hardware failure details decision tree
I understand this is not formatted as a typical decision tree, but I was using basic macros (expand) in pre-2020 Confluence editor.
System does not power on when pressing the power button
POWER SUPPLY - Is there LED lights displaying on the Power Supplies while the system while it is powered off?
Yes
What color are they?
Green
Is it blinking or solid?
Blinking - (system powered off)
- Power supplies is working and is on stand by
Blinking - (system powered on; TYAN systems)
- PSU supplies is working and it is on standby for redundancy
Does system power on yet after checking the power supplies?
(END) Yes
- Move on to next troubleshooting tree if necessary
No
Does system power on after re-seating all Memory DIMM's?
(END) Yes
- check topology in BIOS to make sure all installed memory are identified
No
CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?
- If they cannot perform this troubleshooting, they will need to ship this system back to Exxact for further troubleshooting; issue System RMA
- In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
- See other option to see if they can quickly check if the power button/ribbon cable is the root cause
Yes - swap CPU to see if issue persists; does system power on after swapping CPU?
(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified
- Could be bad CPU pin/slot on the motherboard on secondary CPU slot
- Ask if they are okay with performing RMA on the chassis+motherboard (honestly if they got this far, I'm sure they can swap the barebone)
No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??
- Still could be bad memory, try another memory DIMM to see if issue persists
(END) Yes - Defective memory DIMM; issue component RMA for Memory
- try to have them re-create issue by re-installing suspected DIMM to see if system fails to power on/POST
- repopulate CPU2 and memory in pairs to ensure the rest of the memory DIMM's are allowing system to POST
- check topology in BIOS to make sure all installed memory are identified
(END) No - Defective CPU; issue component RMA for CPU
- Most likely confirmed to be bad CPU since:
- CPU1 slot works
- Installing either of the CPU's into secondary CPU slot does not allow system to power on
- Have them swap the memory DIMM's that were previously installed for CPU2's row into CPU1's to see if all memory is working properly
- System should still be able to power on with 1x CPU and DIMM's but they may lose half the PCI-e slots on certain systems (typically older ones using 2011-v3/v4 CPU's)
(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs
- Could be bad primary CPU1 slot or bad motherboard entirely; issue System RMA for confirmation of issue and repairs
POWER BUTTON - Does pushing power button not power on the system when all of the board and PSU LED lights are on?
- In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
Does system power on after removing the ribbon cable to manually jump the power pins?
If they are unable to do this, then issue System RMA
(END) Yes - Defective Power Button
- Have them re-seat the ribbon cable and try again; we can try to RMA the power button assembly if:
- They agree to perform the labor
- The barebone makes it easily accessible that we can provide a short guide (usually we don't replace this, we would just have MFR send us the assembly)
- Suggest if they are okay in swapping barebone, or make a judgement call whether we should issue System RMA (do you trust them to perform the labor?)
- Make sure CPU/Memory all identified properly in BIOS
No
- Go back and have them try the following tree
CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?
(END) - Solid
- it shouldn't be green and solid while system is powered off; power drain the system and see if issue persist
(END) - Amber - (system powered off, and all PSU's are amber)
- older systems use Amber for standby while system is powered off; try power button and see if they turn to solid green LED
(END) - One is Green, the other(s) is off / Amber / yellow / different (Defective PSU)
- most likely one of the PSU's is bad; try re-seating it and swapping locations with another PSU module. If it follows PSU, then PSU needs Component RMA. If it follows the slot/insert, then barebone needs Component RMA (or System RMA if we need to swap components for customer and re-validate hardware)
(END) - No - (all PSU's off/no lights)
- try different power cables, check outlets, re-seat the PSU module; if no lights/activity, PDB (or barebone) needs Component RMA (or System RMA if we need to swap components for customer and re-validate hardware)
System powers on when pressing the power button, but there is no display
Make sure you go over the following display topics:
- Provide system info flier
- State which display port(s) or graphics card is used for primary display
- Have them double-check monitor that is powered on and using correct display source (VGA/DVI/HDMI/HDMI1/HDMI2/etc...)
- Try other cables/adapters; highly recommend NOT using adapters
Display issues are common for newly received systems or new user of an older one. If customers fail to troubleshoot properly, it needs to be heavily noted that they will need to pay for shipping if we deem system NPF (No Problem Found).
Is correct display port/GPU being used?
Yes
Does display work after system was power drained completely, and powered on with ONLY the correct video display cable/port being used as noted in the system info flier?
(END) Yes - display works, move on to next tree
No - is there display after checking monitor settings and other cables?
- have them double-check monitor that is powered on and using correct display source (VGA/DVI/HDMI/HDMI1/HDMI2/etc...)
- try other cables/adapters that are known working for other systems at their site; highly recommend NOT using adapters
- try other display GPU ports and cables of different display interface types
(END) Yes - display works, move on to next tree
No - (if offboard display channel) is there display on the onboard channel (typically VGA or Motherboard display ports)?
Yes - does motherboard splash screen display? (offboard channel)
Yes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)
- Check for any POST code displayed in addition to the board manufacturer logo
- Commonly observed codes
- Supermicro - 91 - display pushed to offboard channel
- Tyan - E3/AB - display pushed to offboard channel
- ASUS (on workstation motherboards or back of 2U chassis) - OS loaded properly
- B7/B9 (or B codes) is typically a motherboard component (mostly memory) causing system not to complete POST
- Refer to MFR manual for other uncommon codes
- Most important POST code is where the system gets stuck at
- Commonly observed codes
(END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system
No - does monitor appear to be receiving activity when system boots up?
Yes - Is there display after either of the following below? (see notes)
- (mostly single GPU systems) re-seating the GPU(s)
- (multi-GPU servers) swapping the GPU's to see if the one being used for display is not working properly?
(END) Yes - display works, move on to next tree; issue Component RMA for GPU if necessary
- address the GPU if their system boots to OS; run commands to ensure it is being identified by drivers/system; issue Component RMA for the GPU if necessary
No - is there a known working GPU, for display, to test with the system?
Yes - does the display work after removing all GPU's, but only using the known-working GPU for display?
(END) Yes - (multi-GPU servers) install other GPU's and run commands to ensure they is being identified by drivers/system; the one GPU not being identified is possibly defective
(END) Yes - (single GPU systems) issue Component RMA for defective GPU
(END) No - issue System RMA
(END) No - Issue system RMA
- If we're at this point, please be sure they have tried everything above...
- Cables
- Monitors
- Re-seating display ports/connectors
- Send fair warning they will need to pay shipping if we deem system NPF
(END) No - (if onboard display channel) issue system RMA
- If we're at this point, please be sure they have tried everything above...
- Cables
- Monitors
- Re-seating display ports/connectors
- Send fair warning they will need to pay shipping if we deem system NPF
No - okay... make sure correct port is used; see other options in case CMOS/BIOS was reset/updated if system was initially set to offboard channel in BIOS
System powers on when pressing the power button, and displays, but does not boot to OS
Is system display stuck st splash screen?
Yes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)
- Check for any POST code displayed in addition to the board manufacturer logo
- Commonly observed codes
- Supermicro - 91 - display pushed to offboard channel
- Tyan - E3 - display pushed to offboard channel
- ASUS (on workstation motherboards or back of 2U chassis) - OS loaded properly
- B7/B9 (or B codes) is typically a motherboard component (mostly memory) causing system not to complete POST
- Refer to MFR manual for other uncommon codes
- Most important POST code is where the system gets stuck at
- Commonly observed codes
Yes - please continue to see if system fails to POST due to CPU/Memory/Motherboard
Does system POST after re-seating all Memory DIMM's?
(END) Yes
- check topology in BIOS to make sure all installed memory are identified
No
CPU/Memory/Motherboard - Does it POST when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?
- If they cannot perform this troubleshooting, they will need to ship this system back to Exxact for further troubleshooting; issue System RMA
- In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
- See other option to see if they can quickly check if the power button/ribbon cable is the root cause
Yes - swap CPU to see if issue persists; does system power on after swapping CPU?
(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified
- Could be bad CPU pin/slot on the motherboard on secondary CPU slot
- Ask if they are okay with performing RMA on the chassis+motherboard (honestly if they got this far, I'm sure they can swap the barebone)
No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??
- Still could be bad memory, try another memory DIMM to see if issue persists
(END) Yes - Defective memory DIMM; issue component RMA for Memory
- try to have them re-create issue by re-installing suspected DIMM to see if system fails to power on/POST
- repopulate CPU2 and memory in pairs to ensure the rest of the memory DIMM's are allowing system to POST
- check topology in BIOS to make sure all installed memory are identified
(END) No - Defective CPU; issue component RMA for CPU
- Most likely confirmed to be bad CPU since:
- CPU1 slot works
- Installing either of the CPU's into secondary CPU slot does not allow system to power on
- Have them swap the memory DIMM's that were previously installed for CPU2's row into CPU1's to see if all memory is working properly
- System should still be able to power on with 1x CPU and DIMM's but they may lose half the PCI-e slots on certain systems (typically older ones using 2011-v3/v4 CPU's)
(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs
- Could be bad primary CPU1 slot or bad motherboard entirely; issue System RMA for confirmation of issue and repairs
(END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system
No - can BIOS be accessed using the 'del' key while system powers on?
Yes - does system boot to OS after verifying boot priority is set correctly to first scan the disks containing the OS?
- make sure to check all installed drives are identified in the 'boot' tab in BIOS
(END) Yes - Cool
No - does OS boot after physically re-seating the drives?
(END) Yes - Cool
No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?
(END) Yes - boots to OS, see next tree related to OS issues
No - Does system boot up using another (separate) boot drive installed?
(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)
- Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
- escalate to management to have them review scenario or to quote options for drive/OS/SW
- Single boot drive - issue Component RMA
- we need to pre-load the OS/SW at a loss
(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA
No - are you receiving any activity lights on the keyboard while system powers on?
- strike 'caps lock' or 'scroll lock' keys to see if the keyboard LED (if applicable) react
Yes - can system reboot by using 'ctrl+alt+del'?
Yes - (if stuck at a blank screen after splash screens load) can the OS/kernel selection screen be accessed by using 'up+down arrow keys' after splash screen passes during boot?
- Proceed to OS issues tree if able to get to OS/kernel selection screen
(END) Yes - Proceed to OS issues tree if able to get to OS/kernel selection screen
No - Does system boot up using another (separate) boot drive installed?
(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)
- Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
- escalate to management to have them review scenario or to quote options for drive/OS/SW
- Single boot drive - issue Component RMA
- we need to pre-load the OS/SW at a loss
(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA
(END) No - Proceed to OS issues tree if able to get to OS/kernel selection screen
No - can you get to BIOS using another keyboard and/or USB port after restarting the system?
Yes - go back to "No - can BIOS be accessed using the 'del' key while system powers on?"
No - (see notes) reboot system and try to get BIOS or OS/kernel selection screen; can you access either?
- If the OS boots to a certain point, and display driver or core packages are corrupted, you may lose all keyboard/mouse activity; rebooting and trying to boot from a different point/kernel may help proceed with troubleshooting
- (If cannot get to BIOS) Proceed to OS issues tree if able to get to OS/kernel selection screen
Yes - if BIOS, go back up to "No - can BIOS be accessed using the 'del' key while system powers on?"
No - can you get to BIOS by removing all drives and then trying the 'del' key again?
- Make sure system is powered off before removing/installing drives
Yes - are onboard/offboard settings correct?
- This can impact display for the OS, and possibly a sanity check to ensure they are using the correct display configuration to access OS
Yes - Does inserting the drive back in cause the same issue?
- This can impact display for the OS, and possibly a sanity check to ensure they are using the correct display configuration to access OS
Yes - Does system boot up using another (separate) boot drive installed?
(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)
- Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
- escalate to management to have them review scenario or to quote options for drive/OS/SW
- Single boot drive - issue Component RMA
- we need to pre-load the OS/SW at a loss
(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA
(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA
No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?
(END) Yes - boots to OS, see next tree related to OS issues
No - Does system boot up using another (separate) boot drive installed?
(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)
- Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
- escalate to management to have them review scenario or to quote options for drive/OS/SW
- Single boot drive - issue Component RMA
- we need to pre-load the OS/SW at a loss
(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA
Hardware failure details
Immediately observed symptoms
Servers
Component Type | Method | Fail condition | Other possible factors |
---|---|---|---|
Processor |
|
|
|
Memory |
|
|
|
Graphics Card |
|
|
|
Power Supply |
|
|
|
Power Distribution Board |
|
|
|
Motherboard |
|
|
|
Chassis |
|
|
|
Drives |
|
|
|
DevBox
Component Type | Method | Fail condition | Other possible factors |
---|---|---|---|
Processor |
|
|
|
Memory |
|
|
|
Graphics Card |
|
| |
Power Supply |
|
| |
Motherboard |
|
| |
Chassis |
|
|