Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Table of Contents


Hardware failure details decision tree

I understand this is not formatted as a typical decision tree, but I was using basic macros (expand) in pre-2020 Confluence editor.

Expand
titleSystem does not power on when pressing the power button

System does not power on when pressing the power button

Expand
titlePOWER SUPPLY - Is there LED lights displaying on the Power Supplies while the system while it is powered off?

POWER SUPPLY - Is there LED lights displaying on the Power Supplies while the system while it is powered off?

Expand
titleYes

Yes

Expand
titleWhat color are they?

What color are they?

Expand
titleGreen

Green

Expand
titleIs it blinking or solid?

Is it blinking or solid?

Expand
titleBlinking - (system powered off)

Blinking - (system powered off)

  • Power supplies is working and is on stand by

Blinking - (system powered on; TYAN systems) 

  • PSU supplies is working and it is on standby for redundancy
Expand
titleDoes system power on yet after checking the power supplies?

Does system power on yet after checking the power supplies?

Expand
title Yes

(END) Yes 

  • Move on to next troubleshooting tree if necessary


Expand
titleNo

No

Expand
titleDoes system power on after re-seating all Memory DIMM's?

Does system power on after re-seating all Memory DIMM's?

Expand
title(END) Yes

(END) Yes

  • check topology in BIOS to make sure all installed memory are identified


Expand
titleNo

No

Expand
titleCPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

  • If they cannot perform this troubleshooting, they will need to ship this system back to Exxact for further troubleshooting; issue System RMA
  • In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
  • See other option to see if they can quickly check if the power button/ribbon cable is the root cause
Expand
titleYes - swap CPU to see if issue persists; does system power on after swapping CPU?

Yes - swap CPU to see if issue persists; does system power on after swapping CPU?

Expand
title(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

  • Could be bad CPU pin/slot on the motherboard on secondary CPU slot
  • Ask if they are okay with performing RMA on the chassis+motherboard (honestly if they got this far, I'm sure they can swap the barebone)


Expand
titleNo - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

  • Still could be bad memory, try another memory DIMM to see if issue persists
Expand
title(END) Yes - Defective memory DIMM; issue component RMA for Memory

(END) Yes - Defective memory DIMM; issue component RMA for Memory

  • try to have them re-create issue by re-installing suspected DIMM to see if system fails to power on/POST
  • repopulate CPU2 and memory in pairs to ensure the rest of the memory DIMM's are allowing system to POST
  • check topology in BIOS to make sure all installed memory are identified


Expand
title(END) No - Defective CPU; issue component RMA for CPU

(END) No - Defective CPU; issue component RMA for CPU

  • Most likely confirmed to be bad CPU since:
    • CPU1 slot works
    • Installing either of the CPU's into secondary CPU slot does not allow system to power on
  • Have them swap the memory DIMM's that were previously installed for CPU2's row into CPU1's to see if all memory is working properly
  • System should still be able to power on with 1x CPU and DIMM's but they may lose half the PCI-e slots on certain systems (typically older ones using 2011-v3/v4 CPU's)




Expand
titleNo

(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs

  • Could be bad primary CPU1 slot or bad motherboard entirely; issue System RMA for confirmation of issue and repairs


Expand
titlePOWER BUTTON - Does pushing power button not power on the system?

POWER BUTTON - Does pushing power button not power on the system when all of the board and PSU LED lights are on?

  • In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
Expand
titleHave you tried removing the ribbon cable to manually jump the power pins?

Does system power on after removing the ribbon cable to manually jump the power pins?

  • If they are unable to do this, then issue System RMA

Expand
title(END) Yes - Defective Power Button

(END) Yes - Defective Power Button

  • Have them re-seat the ribbon cable and try again; we can try to RMA the power button assembly if:
    • They agree to perform the labor
    • The barebone makes it easily accessible that we can provide a short guide (usually we don't replace this, we would just have MFR send us the assembly)
  • Suggest if they are okay in swapping barebone, or make a judgement call whether we should issue System RMA (do you trust them to perform the labor?)
  • Make sure CPU/Memory all identified properly in BIOS


Expand
titleNo

No

  • Go back and have them try the following tree


    CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?












Expand
titleSolid

(END) - Solid 

  • it shouldn't be green and solid while system is powered off; power drain the system and see if issue persist




Expand
titleAmber - (system powered off, and all PSU's are amber)

(END) - Amber - (system powered off, and all PSU's are amber)

  • older systems use Amber for standby while system is powered off; try power button and see if they turn to solid green LED


Expand
titleOne is Green, the other(s) is off / Amber / yellow / different

(END) - One is Green, the other(s) is off / Amber / yellow / different (Defective PSU)

  • most likely one of the PSU's is bad; try re-seating it and swapping locations with another PSU module. If it follows PSU, then PSU needs Component RMA. If it follows the slot/insert, then barebone needs Component RMA (or System RMA if we need to swap components for customer and re-validate hardware)




Expand
titleNo - (all PSU's off/no lights)

(END) - No - (all PSU's off/no lights)

  • try different power cables, check outlets, re-seat the PSU module; if no lights/activity, PDB (or barebone) needs Component RMA (or System RMA if we need to swap components for customer and re-validate hardware)



...

Expand
titleSystem powers on when pressing the power button, and displays, but does not boot to OS

System powers on when pressing the power button, and displays, but does not boot to OS


Expand
titleIs system display stuck st splash screen?

Is system display stuck st splash screen?

Expand
titleYes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)

Yes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)

  • Check for any POST code displayed in addition to the board manufacturer logo
    • Commonly observed codes
      • Supermicro - 91 - display pushed to offboard channel
      • Tyan - E3 - display pushed to offboard channel
      • ASUS (on workstation motherboards or back of 2U chassis) - OS loaded properly
      • B7/B9 (or B codes) is typically a motherboard component (mostly memory) causing system not to complete POST
      • Refer to MFR manual for other uncommon codes
      • Most important POST code is where the system gets stuck at
Expand
titleYes - please continue to see if system fails to POST due to CPU/Memory/Motherboard

Yes - please continue to see if system fails to POST due to CPU/Memory/Motherboard

Expand
titleDoes system power on after re-seating all Memory DIMM's?

Does system POST after re-seating all Memory DIMM's?

Expand
title(END) Yes

(END) Yes

  • check topology in BIOS to make sure all installed memory are identified


Expand
titleNo

No

Expand
titleCPU/Memory/Motherboard - Does it POST when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

CPU/Memory/Motherboard - Does it POST when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

  • If they cannot perform this troubleshooting, they will need to ship this system back to Exxact for further troubleshooting; issue System RMA
  • In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
  • See other option to see if they can quickly check if the power button/ribbon cable is the root cause
Expand
titleYes - swap CPU to see if issue persists; does system power on after swapping CPU?

Yes - swap CPU to see if issue persists; does system power on after swapping CPU?

Expand
title(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

  • Could be bad CPU pin/slot on the motherboard on secondary CPU slot
  • Ask if they are okay with performing RMA on the chassis+motherboard (honestly if they got this far, I'm sure they can swap the barebone)


Expand
titleNo - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

  • Still could be bad memory, try another memory DIMM to see if issue persists
Expand
title(END) Yes - Defective memory DIMM; issue component RMA for Memory

(END) Yes - Defective memory DIMM; issue component RMA for Memory

  • try to have them re-create issue by re-installing suspected DIMM to see if system fails to power on/POST
  • repopulate CPU2 and memory in pairs to ensure the rest of the memory DIMM's are allowing system to POST
  • check topology in BIOS to make sure all installed memory are identified


Expand
title(END) No - Defective CPU; issue component RMA for CPU

(END) No - Defective CPU; issue component RMA for CPU

  • Most likely confirmed to be bad CPU since:
    • CPU1 slot works
    • Installing either of the CPU's into secondary CPU slot does not allow system to power on
  • Have them swap the memory DIMM's that were previously installed for CPU2's row into CPU1's to see if all memory is working properly
  • System should still be able to power on with 1x CPU and DIMM's but they may lose half the PCI-e slots on certain systems (typically older ones using 2011-v3/v4 CPU's)




Expand
title(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs

(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs

  • Could be bad primary CPU1 slot or bad motherboard entirely; issue System RMA for confirmation of issue and repairs








Expand
titleNo(END) No - they are using ports meant for unsafe-onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system

(END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system



Expand
titleNo - can BIOS be accessed using the 'del' key while system powers on?

No - can BIOS be accessed using the 'del' key while system powers on?

Expand
titleYes - have you checked boot priority to ensure all installed drives are identified, and that the boot priority is set correctly to use the disks containing the OS?

Yes - does system boot to OS after verifying boot priority is set correctly to first scan the disks containing the OS?

  • make sure to check all installed drives are identified in the 'boot' tab in BIOS
Expand
title(END) Yes - Cool

(END) Yes - Cool


Expand
titleNo - does OS boot after physically re-seating the drives?

No - does OS boot after physically re-seating the drives?

Expand
title(END) Yes - Cool

(END) Yes - Cool


Expand
titleNo - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

Expand
title(END) Yes - boots to OS, see next tree related to OS issues

(END) Yes - boots to OS, see next tree related to OS issues


Expand
titleNo - Does system boot up using another (separate) boot drive installed?

No - Does system boot up using another (separate) boot drive installed?

Expand
title(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss


Expand
title(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA






Expand
titleNo - are you receiving any activity lights on the keyboard while system powers on?

No - are you receiving any activity lights on the keyboard while system powers on?

  • strike 'caps lock' or 'scroll lock' keys to see if the keyboard LED (if applicable) react
Expand
titleYes - can system reboot by using 'ctrl+alt+del'?

Yes - can system reboot by using 'ctrl+alt+del'?

Expand
titleYes - (if stuck at a blank screen after splash screens load) can the OS/kernel selection screen be accessed by using 'up+down arrow keys' after splash screen passes during boot?

Yes - (if stuck at a blank screen after splash screens load) can the OS/kernel selection screen be accessed by using 'up+down arrow keys' after splash screen passes during boot?

  • Proceed to OS issues tree if able to get to OS/kernel selection screen
Expand
title(END) Yes - Proceed to OS issues tree if able to get to OS/kernel selection screen

(END) Yes - Proceed to OS issues tree if able to get to OS/kernel selection screen


Expand
titleNo - Does system boot up using another (separate) boot drive installed?

No - Does system boot up using another (separate) boot drive installed?

Expand
title(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss


Expand
title(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA



Expand
title(END) No - Proceed to OS issues tree if able to get to OS/kernel selection screen

(END) No - Proceed to OS issues tree if able to get to OS/kernel selection screen




Expand
titleNo - can you get to BIOS using another keyboard and/or USB port after restarting the system?

No - can you get to BIOS using another keyboard and/or USB port after restarting the system?

Expand
titleYes - go back to "No - can BIOS be accessed using the 'del' key while system powers on?"

Yes - go back to "No - can BIOS be accessed using the 'del' key while system powers on?"


Expand
titleNo - (see notes) reboot system and try to get BIOS or OS/kernel selection screen; can you access either?

No - (see notes) reboot system and try to get BIOS or OS/kernel selection screen; can you access either?

  • If the OS boots to a certain point, and display driver or core packages are corrupted, you may lose all keyboard/mouse activity; rebooting and trying to boot from a different point/kernel may help proceed with troubleshooting
  • (If cannot get to BIOS) Proceed to OS issues tree if able to get to OS/kernel selection screen
Expand
titleYes - if BIOS, go back up to "No - can BIOS be accessed using the 'del' key while system powers on?"

Yes - if BIOS, go back up to "No - can BIOS be accessed using the 'del' key while system powers on?"


Expand
titleNo - can you get to BIOS by removing all drives and then trying the 'del' key again?

No - can you get to BIOS by removing all drives and then trying the 'del' key again?

  • Make sure system is powered off before removing/installing drives
Expand
titleYes - are unsafe-onboard/offboard settings correct?

Yes - are onboard/offboard settings correct?

  • This can impact display for the OS, and possibly a sanity check to ensure they are using the correct display configuration to access OS
Expand
titleYes - Does inserting the drive back in cause the same issue?

Yes - Does inserting the drive back in cause the same issue?

  • This can impact display for the OS, and possibly a sanity check to ensure they are using the correct display configuration to access OS
Expand
titleYes - Does system boot up using another (separate) boot drive installed?

Yes - Does system boot up using another (separate) boot drive installed?

Expand
title(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss


Expand
title(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA



Expand
title(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA



Expand
titleNo - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

Expand
title(END) Yes - boots to OS, see next tree related to OS issues

(END) Yes - boots to OS, see next tree related to OS issues


Expand
titleNo - Does system boot up using another (separate) boot drive installed?

No - Does system boot up using another (separate) boot drive installed?

Expand
title(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss


Expand
title(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA














Hardware failure details

Immediately observed symptoms

Servers


Component Type
Method
Fail condition
Other possible factors
Processor
  • BIOS CPU topology
  • Does not appear for one slot (if dual-processor)
  • Different memory DIMM failing to appear each reboot
  • Bad seating
  • Bent CPU pins on MB
Memory
  • BIOS memory topology
  • Does not appear, or one reads differently/incorrect from the others
  • Bad seating
  • Failing CPU
Graphics Card
  • lspci
  • nvidia-smi
  • device manager
  • OS does not identify GPU
  • Bad seating
  • Damaged GPU pins
Power Supply
  • Check LED status lights for redundant PSU modules
  • Re-seat redundant PSU
  • Depending on barebone, but typically Amber or No-LED is a failed PSU
  • PDB
Power Distribution Board
  • Re-seat redundant PSU
  • Power supply LED indicator remains Amber or No-LED in a single slot
  • BMC event log shows multiple PSU's as failing
  • Power Supply
Motherboard
  • Physical inspection of board
  • No LED when Power Supplies are plugged in
  • Blown capacitors
  • Bad power button
Chassis
  • Does not power on when pressing power button
  • Does not power on when... pressing power button
  • Power Supply
  • PDB
  • Bent pins on MB on primary CPU slot
Drives
  • Check boot options/priority
  • Does not appear on boot options/priority
  • OS/RAID Controller identifies bad SMART status or failed drive
  • Bad seating
  • Bad SATA backplane
  • Improper SATA cabling/seating


DevBox


Component Type
Method
Fail condition
Other possible factors
Processor
  • BIOS CPU topology
  • Does not appear for one slot (if dual-processor)
  • Different memory DIMM failing to appear each reboot
  • Bad seating
  • Bent CPU pins on MB
Memory
  • BIOS memory topology
  • Does not appear, or one reads differently/incorrect from the others
  • Bad seating
  • Failing CPU
Graphics Card
  • lspci
  • nvidia-smi
  • device manager
  • OS does not identify GPU

Power Supply
  • PSU LED (if applicable)
  • No power LED on motherboard
  • Does not allow system to power on when chassis or motherboard power-on options are used

Motherboard
  • Power LED on motherboard
  • No power LED displaying on board when power supply is known-working and plugged in

Chassis
  • Press power button on chassis
  • Pressing/jumping power on motherboard
  • No power when chassis power button is pressed