[INTERNAL USE]
Contents
Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
HOW TO INSTALL TOOL
Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre Ampere (A100 GPU): 629-23587-XX86-FLD-38782.tgz
Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-23587-XX86-FLD-38782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .
The tool is expected to be placed in the /var/diags
folder. Created this folder if it does not exist.
...
Code Block |
---|
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll total 473124 drwxr-xr-x 2 root root 4096 Oct 4 10:55 ./ drwxr-xr-x 3 exx exx 4096 Oct 4 10:55 ../ -rwxr-xr-x 1 exx exx 17895 Sep 11 09:47 fdmain.sh* -rwxr-xr-x 1 exx exx 32888 Sep 11 09:47 fieldiag.sh* -r-xr-xr-x 1 exx exx 11194648 Sep 11 09:47 nvflash* -rwxr-xr-x 1 exx exx 473142559 Sep 11 09:47 onediagfield.r6.252.tgz* -r-xr-xr-x 1 exx exx 2906 Sep 11 09:47 README.txt* -r-xr-xr-x 1 exx exx 1702 Sep 11 09:47 relnotes.txt* -r-xr-xr-x 1 exx exx 18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json* -r-xr-xr-x 1 exx exx 18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json* -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- 1 exx exx 2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json -r-xr-xr-x 1 exx exx 6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json* |
Extracted folder (Ampere) content:
Code Block |
---|
root@rdlab:/var/diags/629-23587-XX86-FLD-38782# ll total 243940 drwxr-xr-x 4 root root 4096 Mar 4 22:55 ./ drwxr-xr-x 4 root root 4096 Mar 5 17:03 ../ drwxr-xr-x 8 root root 4096 Mar 4 22:55 dgx/ -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_log_creation_lock -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_unpack_package_lock -rw-r--r-- 1 root root 26360 Mar 5 00:43 fieldiag.log -rwxr-xr-x 1 exx exx 31232 Sep 19 20:53 fieldiag.sh* -rwxr-xr-x 1 exx exx 238456629 Sep 19 20:53 hgxfieldiag.r3.102* drwxr-xr-x 2 root root 4096 Mar 5 00:43 logs/ -r-xr-xr-x 1 exx exx 11104504 Sep 19 20:53 nvflash* -rwxr-xr-x 1 exx exx 2773 Sep 19 20:53 README.txt* -rwxr-xr-x 1 exx exx 4497 Sep 19 20:53 relnotes.txt* -rwxr-xr-x 1 exx exx 1823 Sep 19 20:53 sku_hgx-a100-8-gpu_40g_aircooled.json* -rwxr-xr-x 1 exx exx 1482 Sep 19 20:53 sku_hgx-a100-8-gpu_40g_hybrid.json* -rwxr-xr-x 1 exx exx 3787 Sep 19 20:53 sku_hgx-a100-8-gpu_80g_aircooled.json* -rwxr-xr-x 1 exx exx 3611 Sep 19 20:53 sku_hgx-a100-8-gpu_80g_hybrid.json* -rwxr-xr-x 1 exx exx 25076 Sep 19 20:53 testargs_hgx-a100-8-gpu_2tray.json* -rwxr-xr-x 1 exx exx 24734 Sep 19 20:53 testargs_hgx-a100-8-gpu_d00_2tray.json* -rwxr-xr-x 1 exx exx 14310 2312 Sep 1119 0920:4753 spectestargs_hopperhgx-hgxa100-8-gpu_sit_fieldd00.json* -rrwxr-xr-xr-x 1 exx exx 683214652 Sep 1119 0920:4753 testargs_hopperhgx-hgxa100-8-gpu.json* |
PROBLEM SITUATION
Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.
...
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
FIELDIAG TOOL USAGE
Review the README.txt for details on usage and options.
...
If the following error is encountered when running the fielddiag.sh, uninstall the existing Nvidia driver on the system. The existing Nvidia driver is conflicting with the toolrun the unload_nvidia-driver.sh script to stop the services.
Code Block |
---|
root@rdlab:/home/exx/smc_fieldiag/629-24287-XXXX-FLD-38780# ./fieldiag.sh Unpacking onediag... Could not determine HGX baseboard SKU |
SMC provided a script to unload Nvidia drivers.
File Name: unload_nvidia_driver.sh
View file | ||
---|---|---|
|
GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.
...
Example of running Field Test showing logs output location.
...
Failure example from ZD-12288: fieldiag.log
View file | ||
---|---|---|
|