Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[INTERNAL USE]

Contents

Table of Contents
minLevel1
maxLevel6
outlinefalse
typelist
printablefalse

HOW TO INSTALL TOOL

Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre Ampere (A100 GPU): 629-2268723587-XX86-FLD-3822538782.tgz

Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-2268723587-XX86-FLD-3822538782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .

The tool is expected to be placed in the /var/diags folder. Created this folder if it does not exist.

...

Code Block
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll
total 473124
drwxr-xr-x 2 root root      4096 Oct  4 10:55 ./
drwxr-xr-x 3 exx  exx       4096 Oct  4 10:55 ../
-rwxr-xr-x 1 exx  exx      17895 Sep 11 09:47 fdmain.sh*
-rwxr-xr-x 1 exx  exx      32888 Sep 11 09:47 fieldiag.sh*
-r-xr-xr-x 1 exx  exx   11194648 Sep 11 09:47 nvflash*
-rwxr-xr-x 1 exx  exx  473142559 Sep 11 09:47 onediagfield.r6.252.tgz*
-r-xr-xr-x 1 exx  exx       2906 Sep 11 09:47 README.txt*
-r-xr-xr-x 1 exx  exx       1702 Sep 11 09:47 relnotes.txt*
-r-xr-xr-x 1 exx  exx      18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json*
-r-xr-xr-x 1 exx  exx      18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json*
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json
-rw-rw-r-- 1 exx  exx       2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json
-r-xr-xr-x 1 exx  exx       6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*

Extracted folder (AmpreAmpere) content:

Code Block
root@rdlab:/var/diags/629-2268723587-XX86-FLD-38225#38782# ll
total 243812243940
drwxr-xr-x 24 root root      4096 FebMar  124 22:2755 ./
drwxr-xr-x 34 root root      4096 Mar Feb 125 2217:2703 ../
drwxr-xr-x 8 root root      4096 Mar  4 22:55 dgx/
-rw-r--r-- 1 root root         0 Mar  4 22:55 dgx_log_creation_lock
-rw-r--r-- 1 root root         0 Mar  4 22:55 dgx_unpack_package_lock
-rw-r--r-- 1 root root     26360 Mar  5 00:43 fieldiag.log
-rwxr-xr-x 1 exx  exx      31232 Sep 19 20:53 fieldiag.sh*
-rwxr-xr-x 1 exx  exx  238456629 Sep 19 20:53 hgxfieldiag.r3.102*
drwxr-xr-x 2 root root   31232 Sep 13 22:39 fieldiag.sh*
-rwxr   4096 Mar  5 00:43 logs/
-r-xr-xr-x 1 exx  exx  238455527 11104504 Sep 1319 2220:39 hgxfieldiag.r3.10053 nvflash*
-rrwxr-xr-xr-x 1 exx  exx   11115992    2773 Sep 1319 2220:3953 nvflashREADME.txt*
-rwxr-xr-x 1 exx  exx       27194497 Sep 1319 2220:3953 READMErelnotes.txt*
-rwxr-xr-x 1 exx  exx       16501823 Sep 1319 2220:39 relnotes.txt53 sku_hgx-a100-8-gpu_40g_aircooled.json*
-rwxr-xr-x 1 exx  exx       10461482 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g_aircooledhybrid.json*
-rwxr-xr-x 1 exx  exx       10463787 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g80g_liquidcooledaircooled.json*
-rwxr-xr-x 1 exx  exx       16763611 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_80g_64ghybrid.json*
-rwxr-xr-x 1 exx  exx      25076 1876 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80g_aircooled2tray.json*
-rwxr-xr-x 1 exx  exx      24734 1798 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80gd00_liquidcooled2tray.json*
-rwxr-xr-x 1 exx  exx       104814310 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_96gd00.json*
-rwxr-xr-x 1 exx  exx      14652 8618 Sep 1319 2220:3953 testargs_hgx-a100-48-gpu.json*

PROBLEM SITUATION

Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

...

View file
nameRMA28206-fabricmanager.log
View file
nameRMA28206-NVSwitch Detection..txt

FIELDIAG TOOL USAGE

Review the README.txt for details on usage and options.

...

View file
nameunload_nvidia_driver.sh

...

GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.

...

Example of running Field Test showing logs output location.

...

Failure example from ZD-12288: fieldiag.log

View file
nameS434992x3806626-fieldiag-FAILED-8xA100.log