Hardware troubleshooting decision tree

Purpose

  • Note typically used terms for system diagnostics for suggesting next troubleshooting steps
  • Separate troubleshooting procedure decision trees based on system symptoms
  • Point of execution for Component or System RMA's

Basic power/display issue for servers (1U/2U/4U or Enclosure-Servers)

 System does not power on when pressing the power button

System does not power on when pressing the power button

 POWER SUPPLY - Is there LED lights displaying on the Power Supplies while the system while it is powered off?

POWER SUPPLY - Is there LED lights displaying on the Power Supplies while the system while it is powered off?

 Yes

Yes

 What color are they?

What color are they?

 Green

Green

 Is it blinking or solid?

Is it blinking or solid?

 Blinking - (system powered off)

Blinking - (system powered off)

  • Power supplies is working and is on stand by

Blinking - (system powered on; TYAN systems) 

  • PSU supplies is working and it is on standby for redundancy
 Does system power on yet after checking the power supplies?

Does system power on yet after checking the power supplies?

  Yes

(END) Yes 

  • Move on to next troubleshooting tree if necessary
 No

No

 Does system power on after re-seating all Memory DIMM's?

Does system power on after re-seating all Memory DIMM's?

 (END) Yes

(END) Yes

  • check topology in BIOS to make sure all installed memory are identified
 No

No

 CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

  • If they cannot perform this troubleshooting, they will need to ship this system back to Exxact for further troubleshooting; issue System RMA
  • In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
  • See other option to see if they can quickly check if the power button/ribbon cable is the root cause
 Yes - swap CPU to see if issue persists; does system power on after swapping CPU?

Yes - swap CPU to see if issue persists; does system power on after swapping CPU?

 (END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

  • Could be bad CPU pin/slot on the motherboard on secondary CPU slot
  • Ask if they are okay with performing RMA on the chassis+motherboard (honestly if they got this far, I'm sure they can swap the barebone)
 No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

  • Still could be bad memory, try another memory DIMM to see if issue persists
 (END) Yes - Defective memory DIMM; issue component RMA for Memory

(END) Yes - Defective memory DIMM; issue component RMA for Memory

  • try to have them re-create issue by re-installing suspected DIMM to see if system fails to power on/POST
  • repopulate CPU2 and memory in pairs to ensure the rest of the memory DIMM's are allowing system to POST
  • check topology in BIOS to make sure all installed memory are identified
 (END) No - Defective CPU; issue component RMA for CPU

(END) No - Defective CPU; issue component RMA for CPU

  • Most likely confirmed to be bad CPU since:
    • CPU1 slot works
    • Installing either of the CPU's into secondary CPU slot does not allow system to power on
  • Have them swap the memory DIMM's that were previously installed for CPU2's row into CPU1's to see if all memory is working properly
  • System should still be able to power on with 1x CPU and DIMM's but they may lose half the PCI-e slots on certain systems (typically older ones using 2011-v3/v4 CPU's)
 No

(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs

  • Could be bad primary CPU1 slot or bad motherboard entirely; issue System RMA for confirmation of issue and repairs
 POWER BUTTON - Does pushing power button not power on the system?

POWER BUTTON - Does pushing power button not power on the system when all of the board and PSU LED lights are on?

  • In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
 Have you tried removing the ribbon cable to manually jump the power pins?

Does system power on after removing the ribbon cable to manually jump the power pins?

  • If they are unable to do this, then issue System RMA

 (END) Yes - Defective Power Button

(END) Yes - Defective Power Button

  • Have them re-seat the ribbon cable and try again; we can try to RMA the power button assembly if:
    • They agree to perform the labor
    • The barebone makes it easily accessible that we can provide a short guide (usually we don't replace this, we would just have MFR send us the assembly)
  • Suggest if they are okay in swapping barebone, or make a judgement call whether we should issue System RMA (do you trust them to perform the labor?)
  • Make sure CPU/Memory all identified properly in BIOS
 No

No

  • Go back and have them try the following tree


    CPU/Memory/Motherboard - Does it power on when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?


 Solid

(END) - Solid 

  • it shouldn't be green and solid while system is powered off; power drain the system and see if issue persist
 Amber - (system powered off, and all PSU's are amber)

(END) - Amber - (system powered off, and all PSU's are amber)

  • older systems use Amber for standby while system is powered off; try power button and see if they turn to solid green LED
 One is Green, the other(s) is off / Amber / yellow / different

(END) - One is Green, the other(s) is off / Amber / yellow / different (Defective PSU)

  • most likely one of the PSU's is bad; try re-seating it and swapping locations with another PSU module. If it follows PSU, then PSU needs Component RMA. If it follows the slot/insert, then barebone needs Component RMA (or System RMA if we need to swap components for customer and re-validate hardware)
 No - (all PSU's off/no lights)

(END) - No - (all PSU's off/no lights)

  • try different power cables, check outlets, re-seat the PSU module; if no lights/activity, PDB (or barebone) needs Component RMA (or System RMA if we need to swap components for customer and re-validate hardware)
 System powers on when pressing the power button, but there is no display

System powers on when pressing the power button, but there is no display

Make sure you go over the following display topics:

  • Provide system info flier
  • State which display port(s) or graphics card is used for primary display
  • Have them double-check monitor that is powered on and using correct display source (VGA/DVI/HDMI/HDMI1/HDMI2/etc...)
  • Try other cables/adapters; highly recommend NOT using adapters

Display issues are common for newly received systems or new user of an older one. If customers fail to troubleshoot properly, it needs to be heavily noted that they will need to pay for shipping if we deem system NPF (No Problem Found).

 Are they using correct display port/GPU?

Is correct display port/GPU being used?

 Yes

Yes

 Does display work after they power drained the system, and power it on with ONLY the correct video display cable/port being used as noted in the system info flier?

Does display work after system was power drained completely, and powered on with ONLY the correct video display cable/port being used as noted in the system info flier?

  (END) Yes - display works, move on to next tree

(END) Yes - display works, move on to next tree

 No - is there display after checking monitor settings and other cables?

No - is there display after checking monitor settings and other cables?

  • have them double-check monitor that is powered on and using correct display source (VGA/DVI/HDMI/HDMI1/HDMI2/etc...)
  • try other cables/adapters that are known working for other systems at their site; highly recommend NOT using adapters
 (END) Yes - display works, move on to next tree

(END) Yes - display works, move on to next tree

 No - (if offboard display channel) is there display on the onboard channel (typically VGA or Motherboard display ports)?

No - (if offboard display channel) is there display on the onboard channel (typically VGA or Motherboard display ports)?

 Yes - does motherboard splash screen display? (offboard channel)

Yes - does motherboard splash screen display? (offboard channel)

 Yes - does the system have any codes that does not involve the display being pushed to offboard channel? (see notes for this one)

Yes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)

  • Check for any POST code displayed in addition to the board manufacturer logo
    • Commonly observed codes
      • Supermicro - 91 - display pushed to offboard channel
      • Tyan - E3 - display pushed to offboard channel
      • ASUS (on workstation motherboards or back of 2U chassis) - OS loaded properly
      • B7/B9 (or B codes) is typically a motherboard component (mostly memory) causing system not to complete POST
      • Refer to MFR manual for other uncommon codes
      • Most important POST code is where the system gets stuck at
 (END) Yes - display works, move on to next tree that is more related to systems that do not complete POST or boot to OS
(END) Yes - display works, move on to next tree that is more related to systems that do not complete POST or boot to OS (offboard channel)
 (END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system

(END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system

 No - does monitor appear to be receiving activity when system boots up?

No - does monitor appear to be receiving activity when system boots up?

 Yes - Is there display after either of the following below? (see notes)

Yes - Is there display after either of the following below? (see notes)

  • (mostly single GPU systems) re-seating the GPU(s)
  • (multi-GPU servers) swapping the GPU's to see if the one being used for display is not working properly?
 (END) Yes - display works, move on to next tree; issue Component RMA for GPU if necessary

(END) Yes - display works, move on to next tree; issue Component RMA for GPU if necessary

  • address the GPU if their system boots to OS; run commands to ensure it is being identified by drivers/system; issue Component RMA for the GPU if necessary
 NoNo - do they have a known working GPU for display to test with the system?

No - is there a known working GPU, for display, to test with the system?

 Yes - does the display work after removing all GPU's, but only using the known-working GPU for display?

Yes - does the display work after removing all GPU's, but only using the known-working GPU for display?

 (END) Yes - possible GPU RMA (see notes)

(END) Yes - (multi-GPU servers) install other GPU's and run commands to ensure they is being identified by drivers/system; the one GPU not being identified is possibly defective

(END) Yes - (single GPU systems) issue Component RMA for defective GPU

 (END) No - issue System RMA

(END) No - issue System RMA

 (END) No - Issue system RMA

(END) No - Issue system RMA

  • If we're at this point, please be sure they have tried everything above...
    • Cables
    • Monitors
    • Re-seating display ports/connectors
  • Send fair warning they will need to pay shipping if we deem system NPF
 (END) No - (if onboard) system RMA

(END) No - (if onboard display channel) issue system RMA

  • If we're at this point, please be sure they have tried everything above...
    • Cables
    • Monitors
    • Re-seating display ports/connectors
  • Send fair warning they will need to pay shipping if we deem system NPF
 No

No - okay... make sure correct port is used; see other options in case CMOS/BIOS was reset/updated if system was initially set to offboard channel in BIOS

 System powers on when pressing the power button, and displays, but does not boot to OS

System powers on when pressing the power button, and displays, but does not boot to OS

 Is system display stuck st splash screen?

Is system display stuck st splash screen?

 Yes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)

Yes - does the system have any codes unrelated to display being pushed to offboard channel? (see notes below)

  • Check for any POST code displayed in addition to the board manufacturer logo
    • Commonly observed codes
      • Supermicro - 91 - display pushed to offboard channel
      • Tyan - E3/AB - display pushed to offboard channel
      • ASUS (on workstation motherboards or back of 2U chassis) - OS loaded properly
      • B7/B9 (or B codes) is typically a motherboard component (mostly memory) causing system not to complete POST
      • Refer to MFR manual for other uncommon codes
      • Most important POST code is where the system gets stuck at
 Yes - please continue to see if system fails to POST due to CPU/Memory/Motherboard

Yes - please continue to see if system fails to POST due to CPU/Memory/Motherboard

 Does system power on after re-seating all Memory DIMM's?

Does system POST after re-seating all Memory DIMM's?

 (END) Yes

(END) Yes

  • check topology in BIOS to make sure all installed memory are identified
 No

No

 CPU/Memory/Motherboard - Does it POST when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

CPU/Memory/Motherboard - Does it POST when system is brought down to 1x CPU and 1x Memory DIMM (on primary/first CPU/memory slot)?

  • If they cannot perform this troubleshooting, they will need to ship this system back to Exxact for further troubleshooting; issue System RMA
  • In red, because this is a line whether the hardware diagnostics is more involved/invasive and customers may damage internal components if not handled properly
  • See other option to see if they can quickly check if the power button/ribbon cable is the root cause
 Yes - swap CPU to see if issue persists; does system power on after swapping CPU?

Yes - swap CPU to see if issue persists; does system power on after swapping CPU?

 (END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

(END) - Yes - Defective motherboard/slot; re-install memory and check topology in BIOS for CPU1 to make sure all installed memory are identified

  • Could be bad CPU pin/slot on the motherboard on secondary CPU slot
  • Ask if they are okay with performing RMA on the chassis+motherboard (honestly if they got this far, I'm sure they can swap the barebone)
 No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

No - swap memory DIMM's; does system power on after swapping through all memory DIMM's that were uninstalled when CPU2 was removed??

  • Still could be bad memory, try another memory DIMM to see if issue persists
 (END) Yes - Defective memory DIMM; issue component RMA for Memory

(END) Yes - Defective memory DIMM; issue component RMA for Memory

  • try to have them re-create issue by re-installing suspected DIMM to see if system fails to power on/POST
  • repopulate CPU2 and memory in pairs to ensure the rest of the memory DIMM's are allowing system to POST
  • check topology in BIOS to make sure all installed memory are identified
 (END) No - Defective CPU; issue component RMA for CPU

(END) No - Defective CPU; issue component RMA for CPU

  • Most likely confirmed to be bad CPU since:
    • CPU1 slot works
    • Installing either of the CPU's into secondary CPU slot does not allow system to power on
  • Have them swap the memory DIMM's that were previously installed for CPU2's row into CPU1's to see if all memory is working properly
  • System should still be able to power on with 1x CPU and DIMM's but they may lose half the PCI-e slots on certain systems (typically older ones using 2011-v3/v4 CPU's)
 (END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs

(END) No - Defective motherboard; issue System RMA for confirmation of issue and repairs

  • Could be bad primary CPU1 slot or bad motherboard entirely; issue System RMA for confirmation of issue and repairs


 No(END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system

(END) No - they are using ports meant for onboard display; make sure any display cables installed to the motherboard (onboard channel) are unplugged and reboot system

 No - can BIOS be accessed using the 'del' key while system powers on?

No - can BIOS be accessed using the 'del' key while system powers on?

 Yes - have you checked boot priority to ensure all installed drives are identified, and that the boot priority is set correctly to use the disks containing the OS?

Yes - does system boot to OS after verifying boot priority is set correctly to first scan the disks containing the OS?

  • make sure to check all installed drives are identified in the 'boot' tab in BIOS
 (END) Yes - Cool

(END) Yes - Cool

 No - does OS boot after physically re-seating the drives?

No - does OS boot after physically re-seating the drives?

 (END) Yes - Cool

(END) Yes - Cool

 No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

 (END) Yes - boots to OS, see next tree related to OS issues

(END) Yes - boots to OS, see next tree related to OS issues

 No - Does system boot up using another (separate) boot drive installed?

No - Does system boot up using another (separate) boot drive installed?

 (END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss
 (END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA



 No - are you receiving any activity lights on the keyboard while system powers on?

No - are you receiving any activity lights on the keyboard while system powers on?

  • strike 'caps lock' or 'scroll lock' keys to see if the keyboard LED (if applicable) react
 Yes - can system reboot by using 'ctrl+alt+del'?

Yes - can system reboot by using 'ctrl+alt+del'?

 Yes - (if stuck at a blank screen after splash screens load) can the OS/kernel selection screen be accessed by using 'up+down arrow keys' after splash screen passes during boot?

Yes - (if stuck at a blank screen after splash screens load) can the OS/kernel selection screen be accessed by using 'up+down arrow keys' after splash screen passes during boot?

  • Proceed to OS issues tree if able to get to OS/kernel selection screen
 (END) Yes - Proceed to OS issues tree if able to get to OS/kernel selection screen

(END) Yes - Proceed to OS issues tree if able to get to OS/kernel selection screen

 No - Does system boot up using another (separate) boot drive installed?

No - Does system boot up using another (separate) boot drive installed?

 (END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss
 (END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA


 (END) No - Proceed to OS issues tree if able to get to OS/kernel selection screen

(END) No - Proceed to OS issues tree if able to get to OS/kernel selection screen

 No - can you get to BIOS using another keyboard and/or USB port after restarting the system?

No - can you get to BIOS using another keyboard and/or USB port after restarting the system?

 Yes - go back to "No - can BIOS be accessed using the 'del' key while system powers on?"

Yes - go back to "No - can BIOS be accessed using the 'del' key while system powers on?"

 No - (see notes) reboot system and try to get BIOS or OS/kernel selection screen; can you access either?

No - (see notes) reboot system and try to get BIOS or OS/kernel selection screen; can you access either?

  • If the OS boots to a certain point, and display driver or core packages are corrupted, you may lose all keyboard/mouse activity; rebooting and trying to boot from a different point/kernel may help proceed with troubleshooting
  • (If cannot get to BIOS) Proceed to OS issues tree if able to get to OS/kernel selection screen
 Yes - if BIOS, go back up to "No - can BIOS be accessed using the 'del' key while system powers on?"

Yes - if BIOS, go back up to "No - can BIOS be accessed using the 'del' key while system powers on?"

 No - can you get to BIOS by removing all drives and then trying the 'del' key again?

No - can you get to BIOS by removing all drives and then trying the 'del' key again?

  • Make sure system is powered off before removing/installing drives
 Yes - are onboard/offboard settings correct?

Yes - are onboard/offboard settings correct?

  • This can impact display for the OS, and possibly a sanity check to ensure they are using the correct display configuration to access OS
 Yes - Does inserting the drive back in cause the same issue?

Yes - Does inserting the drive back in cause the same issue?

  • This can impact display for the OS, and possibly a sanity check to ensure they are using the correct display configuration to access OS
 Yes - Does system boot up using another (separate) boot drive installed?

Yes - Does system boot up using another (separate) boot drive installed?

 (END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss
 (END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA


 (END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

 No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

No - (if system was set to use offboard orginally) can you get OS if you change primary display setting to 'onboard'?

 (END) Yes - boots to OS, see next tree related to OS issues

(END) Yes - boots to OS, see next tree related to OS issues

 No - Does system boot up using another (separate) boot drive installed?

No - Does system boot up using another (separate) boot drive installed?

 (END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

(END) Yes - Corrupted boot drive/OS; possible Component RMA, or escalation (see notes)

  • Multiple boot drives and are setup as RAID1 - it is unlikely both drives corrupted at the same time unless the OS/kernel was altered
    • escalate to management to have them review scenario or to quote options for drive/OS/SW
  • Single boot drive - issue Component RMA
    • we need to pre-load the OS/SW at a loss
 (END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA

(END) No - (unlikely) Motherboard/SATA port issue with board; issue System RMA


Need to add

  • Constant reboot (for server chassis)
    1. Reseat power components (redundant, cables)
    2. Reduce to minimal POST config
    3. If none of the above resolves, most likely PDB or MB
      1. Annoying toss-up

Basic power/display issue for workstations (Dev Box or systems with single-display option)

Check DevBox sections under:

/wiki/spaces/WAR/pages/789384865
/wiki/spaces/WAR/pages/789417550

System hangs during POST

Will elaborate later, but this is to generally troubleshoot BMC issues with server motherboards.

Issue: System hangs at BIOS:

IF:

- PSU's are properly plugged in and known-working (correct LED activity)

- All pre-installed [from Vendor] MB components required for POST have been reseated


THEN:

- Configure system for minimal POST configuration [Dual-CPU systems reduced to single-CPU systems installed on first CPU MB socket; single DIMM of memory installed on first Memory DIMM slot for that first CPU MB socket]

*- If dual-CPU system, swap CPU and DIMM to confirm whether root cause is from CPU or DIMM [Need to cycle both CPU's, and less likely that 2x DIMM's fail simultaneously]


IF MINIMAL POST CONFIGURATION STILL DOES NOT ALLOW POST, THEN:

- Reset BMC via IPMI [if accessible]

- Reset BMC via CMOS battery [power drain system, remove CMOS, press+hold power button for 30 seconds]