Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[INTERNAL USE]

...

Contents

Table of Contents
minLevel1
maxLevel6
outlinefalse
typelist
printablefalse

HOW TO INSTALL TOOL

Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre (A100 GPU): 629-23587-XX86-FLD-38782.tgz

Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/629-24287-XXXX-FLD-38780.tgz .
Ampre (A100 GPU): scp root@172.25.10.35:/root/629-23587-XX86-FLD-38782.tgz .

The tool is expected to be placed in the /var/diags folder. Created this folder if it does not exist.

Extracted folder (Hopper) content:

Code Block
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll
total 473124
drwxr-xr-x 2 root root      4096 Oct  4 10:55 ./
drwxr-xr-x 3 exx  exx       4096 Oct  4 10:55 ../
-rwxr-xr-x 1 exx  exx      17895 Sep 11 09:47 fdmain.sh*
-rwxr-xr-x 1 exx  exx      32888 Sep 11 09:47 fieldiag.sh*
-r-xr-xr-x 1 exx  exx   11194648 Sep 11 09:47 nvflash*
-rwxr-xr-x 1 exx  exx  473142559 Sep 11 09:47 onediagfield.r6.252.tgz*
-r-xr-xr-x 1 exx  exx       2906 Sep 11 09:47 README.txt*
-r-xr-xr-x 1 exx  exx       1702 Sep 11 09:47 relnotes.txt*
-r-xr-xr-x 1 exx  exx      18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json*
-r-xr-xr-x 1 exx  exx      18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json*
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json
-rw-rw-r-- 1 exx  exx       2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json
-r-xr-xr-x 1 exx  exx       6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*

PROBLEM SITUATION

Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

...

View file
nameRMA28206-fabricmanager.log
View file
nameRMA28206-NVSwitch Detection..txt

TOOL USAGE

Review the README.txt

...

INVESTIGATION DETAILS

The

Tool Installationfor details on usage and options.

View file
nameREADME.txt

If the following error is encountered when running the fielddiag.sh, uninstall the existing Nvidia driver on the system. The existing Nvidia driver is conflicting with the tool.

Code Block
root@rdlab:/home/exx/smc_fieldiag/629-24287-XXXX-FLD-38780# ./fieldiag.sh
Unpacking onediag...
Could not determine HGX baseboard SKU