HGX Field Diagnostic Tool Usage/Examples

[INTERNAL USE]

Contents

 

HOW TO INSTALL TOOL

Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampere (A100 GPU): 629-23587-XX86-FLD-38782.tgz

Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-23587-XX86-FLD-38782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .

The tool is expected to be placed in the /var/diags folder. Created this folder if it does not exist.

Extracted folder (Hopper) content:

root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll total 473124 drwxr-xr-x 2 root root 4096 Oct 4 10:55 ./ drwxr-xr-x 3 exx exx 4096 Oct 4 10:55 ../ -rwxr-xr-x 1 exx exx 17895 Sep 11 09:47 fdmain.sh* -rwxr-xr-x 1 exx exx 32888 Sep 11 09:47 fieldiag.sh* -r-xr-xr-x 1 exx exx 11194648 Sep 11 09:47 nvflash* -rwxr-xr-x 1 exx exx 473142559 Sep 11 09:47 onediagfield.r6.252.tgz* -r-xr-xr-x 1 exx exx 2906 Sep 11 09:47 README.txt* -r-xr-xr-x 1 exx exx 1702 Sep 11 09:47 relnotes.txt* -r-xr-xr-x 1 exx exx 18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json* -r-xr-xr-x 1 exx exx 18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json* -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- 1 exx exx 2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json -r-xr-xr-x 1 exx exx 6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*

Extracted folder (Ampere) content:

root@rdlab:/var/diags/629-23587-XX86-FLD-38782# ll total 243940 drwxr-xr-x 4 root root 4096 Mar 4 22:55 ./ drwxr-xr-x 4 root root 4096 Mar 5 17:03 ../ drwxr-xr-x 8 root root 4096 Mar 4 22:55 dgx/ -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_log_creation_lock -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_unpack_package_lock -rw-r--r-- 1 root root 26360 Mar 5 00:43 fieldiag.log -rwxr-xr-x 1 exx exx 31232 Sep 19 20:53 fieldiag.sh* -rwxr-xr-x 1 exx exx 238456629 Sep 19 20:53 hgxfieldiag.r3.102* drwxr-xr-x 2 root root 4096 Mar 5 00:43 logs/ -r-xr-xr-x 1 exx exx 11104504 Sep 19 20:53 nvflash* -rwxr-xr-x 1 exx exx 2773 Sep 19 20:53 README.txt* -rwxr-xr-x 1 exx exx 4497 Sep 19 20:53 relnotes.txt* -rwxr-xr-x 1 exx exx 1823 Sep 19 20:53 sku_hgx-a100-8-gpu_40g_aircooled.json* -rwxr-xr-x 1 exx exx 1482 Sep 19 20:53 sku_hgx-a100-8-gpu_40g_hybrid.json* -rwxr-xr-x 1 exx exx 3787 Sep 19 20:53 sku_hgx-a100-8-gpu_80g_aircooled.json* -rwxr-xr-x 1 exx exx 3611 Sep 19 20:53 sku_hgx-a100-8-gpu_80g_hybrid.json* -rwxr-xr-x 1 exx exx 25076 Sep 19 20:53 testargs_hgx-a100-8-gpu_2tray.json* -rwxr-xr-x 1 exx exx 24734 Sep 19 20:53 testargs_hgx-a100-8-gpu_d00_2tray.json* -rwxr-xr-x 1 exx exx 14310 Sep 19 20:53 testargs_hgx-a100-8-gpu_d00.json* -rwxr-xr-x 1 exx exx 14652 Sep 19 20:53 testargs_hgx-a100-8-gpu.json*

PROBLEM SITUATION

Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

The reported issue related to 4x NVSwitch used by 8x H100 GPU. The problem was that only 3 of 4 NVSwitches was recognized. Due to this problem Fabric Manager Service was unable to run.

Unable to start Fabric Manager Service.

root@rdlab:/home/exx# systemctl enable nvidia-fabricmanager.service root@rdlab:/home/exx# systemctl start nvidia-fabricmanager.service Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.

 

 Checked status.

 
It looks like there was one success, but rest are failures.

Log Files:

 

FIELDIAG TOOL USAGE

Review the README.txt for details on usage and options.

If the following error is encountered when running the fielddiag.sh, run the unload_nvidia-driver.sh script to stop the services.

SMC provided a script to unload Nvidia drivers.

File Name: unload_nvidia_driver.sh

 

GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.

 

Test methods requested by Supermicro from CRM ticket to focus on the NVSwitch issue.

Run the following command and change the test speed value in the connectivity JSON section to Gen 3 - 8000.

./fieldiag.sh --no_bmc --sit # or --level1 or --level2

spec_hopper-hgx-8-gpu_sit_field.json
spec_hopper-hgx-8-gpu_level1_field.json
spec_hopper-hgx-8-gpu_level2_field.json

Section to alter speed value.

Example of running Field Test showing logs output location.

Failure example from ZD-12288: fieldiag.log