Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[INTERNAL USE]

Contents

Table of Contents
minLevel1
maxLevel6
outlinefalse
typelist
printablefalse

HOW TO INSTALL TOOL

Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre Ampere (A100 GPU): 629-23587-XX86-FLD-38782.tgz

Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-2268723587-XX86-FLD-3822538782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .

The tool is expected to be placed in the /var/diags folder. Created this folder if it does not exist.

...

Code Block
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll
total 473124
drwxr-xr-x 2 root root      4096 Oct  4 10:55 ./
drwxr-xr-x 3 exx  exx       4096 Oct  4 10:55 ../
-rwxr-xr-x 1 exx  exx      17895 Sep 11 09:47 fdmain.sh*
-rwxr-xr-x 1 exx  exx      32888 Sep 11 09:47 fieldiag.sh*
-r-xr-xr-x 1 exx  exx   11194648 Sep 11 09:47 nvflash*
-rwxr-xr-x 1 exx  exx  473142559 Sep 11 09:47 onediagfield.r6.252.tgz*
-r-xr-xr-x 1 exx  exx       2906 Sep 11 09:47 README.txt*
-r-xr-xr-x 1 exx  exx       1702 Sep 11 09:47 relnotes.txt*
-r-xr-xr-x 1 exx  exx      18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json*
-r-xr-xr-x 1 exx  exx      18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json*
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json
-rw-rw-r-- 1 exx  exx       2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json
-r-xrxr-xr-x 1 exx  exx       6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*

Extracted folder (Ampere) content:

Code Block
root@rdlab:/var/diags/629-23587-XX86-FLD-38782# ll
total 243940
drwxr-xr-x 4 root root      4096 Mar  4 22:55 ./
drwxr-xr-x 4 root root      4096 Mar  5 17:03 ../
drwxr-xr-x 8 root root      4096 Mar  4 22:55 dgx/
-rw-r--r-- 1 root root         0 Mar  4 22:55 dgx_log_creation_lock
-rw-r--r-- 1 root root         0 Mar  4 22:55 dgx_unpack_package_lock
-rw-r--r-- 1 root root     26360 Mar  5 00:43 fieldiag.log
-rwxr-xr-x 1 exx  exx      31232 Sep 19 20:53 fieldiag.sh*
-rwxr-xr-x 1 exx  exx  238456629 Sep 19 20:53 hgxfieldiag.r3.102*
drwxr-xr-x 2 root root      4096 Mar  5 00:43 logs/
-r-xr-xr-x 1 exx  exx   11104504 Sep 19 20:53 nvflash*
-rwxr-xr-x 1 exx  exx       2773 Sep 19 20:53 README.txt*
-rwxr-xr-x 1 exx  exx       4497 Sep 19 20:53 relnotes.txt*
-rwxr-xr-x 1 exx  exx       1823 Sep 19 20:53 sku_hgx-a100-8-gpu_40g_aircooled.json*
-rwxr-xr-x 1 exx  exx       1482 Sep 19 20:53 sku_hgx-a100-8-gpu_40g_hybrid.json*
-rwxr-xr-x 1 exx  exx       3787 Sep 19 20:53 sku_hgx-a100-8-gpu_80g_aircooled.json*
-rwxr-xr-x 1 exx  exx       3611 Sep 19 20:53 sku_hgx-a100-8-gpu_80g_hybrid.json*
-rwxr-xr-x 1 exx  exx      25076 Sep 19 20:53 testargs_hgx-a100-8-gpu_2tray.json*
-rwxr-xr-x 1 exx  exx      24734 Sep 19 20:53 testargs_hgx-a100-8-gpu_d00_2tray.json*
-rwxr-xr-x 1 exx  exx      14310 Sep 19 20:53 testargs_hgx-a100-8-gpu_d00.json*
-rwxr-xr-x 1 exx  exx      14652 6832 Sep 1119 0920:4753 testargs_hopperhgx-hgxa100-8-gpu.json*

PROBLEM SITUATION

Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

...

View file
nameRMA28206-fabricmanager.log
View file
nameRMA28206-NVSwitch Detection..txt

FIELDIAG TOOL USAGE

Review the README.txt for details on usage and options.

...

View file
nameunload_nvidia_driver.sh

If the script does not correct the issue, then uninstall the existing Nvidia driver on the system. The existing Nvidia driver could be conflicting with this tool.

GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.

...

Example of running Field Test showing logs output location.

...

Failure example from ZD-12288: fieldiag.log

View file
nameS434992x3806626-fieldiag-FAILED-8xA100.log