[INTERNAL USE]
Contents
HOW TO INSTALL TOOL
Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre (A100 GPU): 629-23587-XX86-FLD-38782.tgz
Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-23587-XX86-FLD-38782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .
The tool is expected to be placed in the /var/diags
folder. Created this folder if it does not exist.
Extracted folder (Hopper) content:
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll total 473124 drwxr-xr-x 2 root root 4096 Oct 4 10:55 ./ drwxr-xr-x 3 exx exx 4096 Oct 4 10:55 ../ -rwxr-xr-x 1 exx exx 17895 Sep 11 09:47 fdmain.sh* -rwxr-xr-x 1 exx exx 32888 Sep 11 09:47 fieldiag.sh* -r-xr-xr-x 1 exx exx 11194648 Sep 11 09:47 nvflash* -rwxr-xr-x 1 exx exx 473142559 Sep 11 09:47 onediagfield.r6.252.tgz* -r-xr-xr-x 1 exx exx 2906 Sep 11 09:47 README.txt* -r-xr-xr-x 1 exx exx 1702 Sep 11 09:47 relnotes.txt* -r-xr-xr-x 1 exx exx 18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json* -r-xr-xr-x 1 exx exx 18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json* -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- 1 exx exx 2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json -r-xr-xr-x 1 exx exx 6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*
Extracted folder (Ampre) content:
root@rdlab:/var/diags/629-22687-XX86-FLD-38225# ll total 243812 drwxr-xr-x 2 root root 4096 Feb 12 22:27 ./ drwxr-xr-x 3 root root 4096 Feb 12 22:27 ../ -rwxr-xr-x 1 exx exx 31232 Sep 13 22:39 fieldiag.sh* -rwxr-xr-x 1 exx exx 238455527 Sep 13 22:39 hgxfieldiag.r3.100* -r-xr-xr-x 1 exx exx 11115992 Sep 13 22:39 nvflash* -rwxr-xr-x 1 exx exx 2719 Sep 13 22:39 README.txt* -rwxr-xr-x 1 exx exx 1650 Sep 13 22:39 relnotes.txt* -rwxr-xr-x 1 exx exx 1046 Sep 13 22:39 sku_hgx-a100-4-gpu_40g_aircooled.json* -rwxr-xr-x 1 exx exx 1046 Sep 13 22:39 sku_hgx-a100-4-gpu_40g_liquidcooled.json* -rwxr-xr-x 1 exx exx 1676 Sep 13 22:39 sku_hgx-a100-4-gpu_64g.json* -rwxr-xr-x 1 exx exx 1876 Sep 13 22:39 sku_hgx-a100-4-gpu_80g_aircooled.json* -rwxr-xr-x 1 exx exx 1798 Sep 13 22:39 sku_hgx-a100-4-gpu_80g_liquidcooled.json* -rwxr-xr-x 1 exx exx 1048 Sep 13 22:39 sku_hgx-a100-4-gpu_96g.json* -rwxr-xr-x 1 exx exx 8618 Sep 13 22:39 testargs_hgx-a100-4-gpu.json*
PROBLEM SITUATION
Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.
The reported issue related to 4x NVSwitch used by 8x H100 GPU. The problem was that only 3 of 4 NVSwitches was recognized. Due to this problem Fabric Manager Service was unable to run.
Unable to start Fabric Manager Service.
root@rdlab:/home/exx# systemctl enable nvidia-fabricmanager.service root@rdlab:/home/exx# systemctl start nvidia-fabricmanager.service Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
Checked status.
root@rdlab:/home/exx# systemctl status nvidia-fabricmanager.service × nvidia-fabricmanager.service - NVIDIA fabric manager service Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2023-09-29 15:18:28 PDT; 1min 41s ago Process: 16773 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE) CPU: 16ms Sep 29 15:18:27 rdlab systemd[1]: Starting NVIDIA fabric manager service... Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: Connected to 1 node. Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.
It looks like there was one success, but rest are failures.
root@rdlab:/home/exx# journalctl -xeu nvidia-fabricmanager.service Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'. Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service. ░░ Subject: A start job for unit nvidia-fabricmanager.service has failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has finished with a failure. ░░ ░░ The job identifier is 10553 and the job result is failed. Sep 29 15:23:52 rdlab systemd[1]: Starting NVIDIA fabric manager service... ░░ Subject: A start job for unit nvidia-fabricmanager.service has begun execution ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has begun execution. ░░ ░░ The job identifier is 10924. Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: Connected to 1 node. Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'. Sep 29 15:23:53 rdlab systemd[1]: Failed to start NVIDIA fabric manager service. ░░ Subject: A start job for unit nvidia-fabricmanager.service has failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has finished with a failure. ░░ ░░ The job identifier is 10924 and the job result is failed.
Log Files:
FIELDIAG TOOL USAGE
Review the README.txt for details on usage and options.
If the following error is encountered when running the fielddiag.sh, run the unload_nvidia-driver.sh script to stop the services.
root@rdlab:/home/exx/smc_fieldiag/629-24287-XXXX-FLD-38780# ./fieldiag.sh Unpacking onediag... Could not determine HGX baseboard SKU
SMC provided a script to unload Nvidia drivers.
File Name: unload_nvidia_driver.sh
GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ./fieldiag.sh --gpufielddiag /var/diags/629-24287-XXXX-FLD-38780 Warning: Stopping systemd-udevd.service, but it can still be activated by: systemd-udevd-kernel.socket systemd-udevd-control.socket ************************************************************ * * * GPU FIELD DIAGNOSTIC * * * ************************************************************ Version 629-24287-XXXX-FLD-38780 Logs /var/diags/629-24287-XXXX-FLD-38780/dgx/logs-20231002-150508/gpu_fd_logs Running fieldiag... GPU Devices Under Test: 0:04:00.0 0:23:00.0 0:43:00.0 0:64:00.0 0:84:00.0 0:a3:00.0 0:c3:00.0 0:e4:00.0 Running Parallel Tests MODS start: Mon Oct 2 15:06:29 2023 GPU 0: RUNNING GPU 1: RUNNING GPU 2: RUNNING GPU 3: RUNNING GPU 4: RUNNING GPU 5: RUNNING GPU 6: RUNNING GPU 7: RUNNING GPU 0: RUNNING GPU 1: RUNNING GPU 2: FAIL GPU 3: FAIL GPU 4: FAIL GPU 5: FAIL GPU 6: FAIL GPU 7: FAIL Initializing... |====================| 100.0 % Running test 489 on GPU 0 [7] [04:00.0] - 2 tests remaining |====================| 99.6 % Done test 491 on GPU 0 [7] [04:00.0] - 0 tests remaining |====================| 100.0 %FAIL GPU 7: FAIL Error Code = 000000000000 (ok) ####### #### ###### ###### ######## ###### ######## ######## ## ## ## ## ## # ## # ## ## ## ## ### ### ######## ######## #### #### ####### ######## ### ### ## ## ## # ## # ## ## ## ## ######## ######## ## ## ## ###### ###### MODS end : Mon Oct 2 23:25:10 2023 [29921.052 seconds (08:18:41.052 h:m:s)] MODS end : Mon Oct 2 23:25:10 2023 [29921.081 seconds (08:18:41.081 h:m:s)] GPU 0: PENDING GPU 1: PENDING GPU 2: FAIL GPU 3: FAIL GPU 4: FAIL GPU 5: FAIL GPU 6: FAIL GPU 7: FAIL ls: cannot access '__fieldiag2/*.log': No such file or directory ls: cannot access '__fieldiag3/*.log': No such file or directory ls: cannot access '__fieldiag4/*.log': No such file or directory ls: cannot access '__fieldiag5/*.log': No such file or directory ls: cannot access '__fieldiag6/*.log': No such file or directory ls: cannot access '__fieldiag7/*.log': No such file or directory ---------------------- Fieldiag Testing Completed GPU 0: PASS GPU 1: PASS GPU 2: FAIL GPU 3: FAIL GPU 4: FAIL GPU 5: FAIL GPU 6: FAIL GPU 7: FAIL Results Summary GPU ID | GPU SN# | STATUS =============================================== GPU0 | 1650723008686 | PASS GPU1 | 1650723001355 | PASS ####### #### ###### ###### ######## ###### ######## ######## ## ## ## ## ## # ## # ## ## ## ## ### ### ######## ######## #### #### ####### ######## ### ### ## ## ## # ## # ## ## ## ## ######## ######## ## ## ## ###### ###### Done Failed to send reload request: No such file or directory
Test methods requested by Supermicro from CRM ticket to focus on the NVSwitch issue.
Run the following command and change the test speed value in the connectivity JSON section to Gen 3 - 8000.
./fieldiag.sh --no_bmc --sit # or --level1 or --level2
spec_hopper-hgx-8-gpu_sit_field.json
spec_hopper-hgx-8-gpu_level1_field.json
spec_hopper-hgx-8-gpu_level2_field.json
Section to alter speed value.
Example of running Field Test showing logs output location.