[INTERNAL USE]
Contents
HOW TO INSTALL TOOL
Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre (A100 GPU): 629-23587-XX86-FLD-38782.tgz
Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/629-24287-XXXX-FLD-38780.tgz .
Ampre (A100 GPU): scp root@172.25.10.35:/root/629-23587-XX86-FLD-38782.tgz .
The tool is expected to be placed in the /var/diags
folder. Created this folder if it does not exist.
Extracted folder (Hopper) content:
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll total 473124 drwxr-xr-x 2 root root 4096 Oct 4 10:55 ./ drwxr-xr-x 3 exx exx 4096 Oct 4 10:55 ../ -rwxr-xr-x 1 exx exx 17895 Sep 11 09:47 fdmain.sh* -rwxr-xr-x 1 exx exx 32888 Sep 11 09:47 fieldiag.sh* -r-xr-xr-x 1 exx exx 11194648 Sep 11 09:47 nvflash* -rwxr-xr-x 1 exx exx 473142559 Sep 11 09:47 onediagfield.r6.252.tgz* -r-xr-xr-x 1 exx exx 2906 Sep 11 09:47 README.txt* -r-xr-xr-x 1 exx exx 1702 Sep 11 09:47 relnotes.txt* -r-xr-xr-x 1 exx exx 18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json* -r-xr-xr-x 1 exx exx 18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json* -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- 1 exx exx 2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json -r-xr-xr-x 1 exx exx 6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*
PROBLEM SITUATION
Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.
The reported issue related to 4x NVSwitch used by 8x H100 GPU. The problem was that only 3 of 4 NVSwitches was recognized. Due to this problem Fabric Manager Service was unable to run.
Unable to start Fabric Manager Service.
root@rdlab:/home/exx# systemctl enable nvidia-fabricmanager.service root@rdlab:/home/exx# systemctl start nvidia-fabricmanager.service Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
Checked status.
root@rdlab:/home/exx# systemctl status nvidia-fabricmanager.service × nvidia-fabricmanager.service - NVIDIA fabric manager service Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2023-09-29 15:18:28 PDT; 1min 41s ago Process: 16773 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE) CPU: 16ms Sep 29 15:18:27 rdlab systemd[1]: Starting NVIDIA fabric manager service... Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: Connected to 1 node. Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.
It looks like there was one success, but rest are failures.
root@rdlab:/home/exx# journalctl -xeu nvidia-fabricmanager.service Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'. Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service. ░░ Subject: A start job for unit nvidia-fabricmanager.service has failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has finished with a failure. ░░ ░░ The job identifier is 10553 and the job result is failed. Sep 29 15:23:52 rdlab systemd[1]: Starting NVIDIA fabric manager service... ░░ Subject: A start job for unit nvidia-fabricmanager.service has begun execution ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has begun execution. ░░ ░░ The job identifier is 10924. Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: Connected to 1 node. Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'. Sep 29 15:23:53 rdlab systemd[1]: Failed to start NVIDIA fabric manager service. ░░ Subject: A start job for unit nvidia-fabricmanager.service has failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has finished with a failure. ░░ ░░ The job identifier is 10924 and the job result is failed.
Log Files:
TOOL USAGE
Review the README.txt for details on usage and options.
If the following error is encountered when running the fielddiag.sh, uninstall the existing Nvidia driver on the system. The existing Nvidia driver is conflicting with the tool.
root@rdlab:/home/exx/smc_fieldiag/629-24287-XXXX-FLD-38780# ./fieldiag.sh Unpacking onediag... Could not determine HGX baseboard SKU