[INTERNAL USE]
Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.
Tool File Name: 629-24287-XXXX-FLD-38780.tgz
Download Location: scp root@172.25.10.35:/root/629-24287-XXXX-FLD-38780.tgz .
The reported issue related to 4x NVSwitch used by 8x H100 GPU. The problem was that only 3 of 4 NVSwitches was recognized. Due to this problem Fabric Manager Service was unable to run.
Unable to start Fabric Manager Service.
root@rdlab:/home/exx# systemctl enable nvidia-fabricmanager.service root@rdlab:/home/exx# systemctl start nvidia-fabricmanager.service Job for nvidia-fabricmanager.service failed because the control process exited with error code. See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.
Checked status.
root@rdlab:/home/exx# systemctl status nvidia-fabricmanager.service × nvidia-fabricmanager.service - NVIDIA fabric manager service Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2023-09-29 15:18:28 PDT; 1min 41s ago Process: 16773 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE) CPU: 16ms Sep 29 15:18:27 rdlab systemd[1]: Starting NVIDIA fabric manager service... Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: Connected to 1 node. Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.
It looks like there was one success, but rest are failures.
root@rdlab:/home/exx# journalctl -xeu nvidia-fabricmanager.service Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'. Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service. ░░ Subject: A start job for unit nvidia-fabricmanager.service has failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has finished with a failure. ░░ ░░ The job identifier is 10553 and the job result is failed. Sep 29 15:23:52 rdlab systemd[1]: Starting NVIDIA fabric manager service... ░░ Subject: A start job for unit nvidia-fabricmanager.service has begun execution ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has begun execution. ░░ ░░ The job identifier is 10924. Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: Connected to 1 node. Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma> Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE ░░ Subject: Unit process exited ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited. ░░ ░░ The process' exit code is 'exited' and its exit status is 1. Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. ░░ Subject: Unit failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'. Sep 29 15:23:53 rdlab systemd[1]: Failed to start NVIDIA fabric manager service. ░░ Subject: A start job for unit nvidia-fabricmanager.service has failed ░░ Defined-By: systemd ░░ Support: http://www.ubuntu.com/support ░░ ░░ A start job for unit nvidia-fabricmanager.service has finished with a failure. ░░ ░░ The job identifier is 10924 and the job result is failed.
Log Files: