Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

[INTERNAL USE]

Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

Tool File Name: 629-24287-XXXX-FLD-38780.tgz
Download Location: scp root@172.25.10.35:/root/629-24287-XXXX-FLD-38780.tgz .

The reported issue related to 4x NVSwitch used by 8x H100 GPU. The problem was that only 3 of 4 NVSwitches was recognized. Due to this problem Fabric Manager Service was unable to run.

Unable to start Fabric Manager Service.

root@rdlab:/home/exx# systemctl enable nvidia-fabricmanager.service
root@rdlab:/home/exx# systemctl start nvidia-fabricmanager.service
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.

 Checked status.

root@rdlab:/home/exx# systemctl status nvidia-fabricmanager.service
× nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2023-09-29 15:18:28 PDT; 1min 41s ago
    Process: 16773 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
        CPU: 16ms

Sep 29 15:18:27 rdlab systemd[1]: Starting NVIDIA fabric manager service...
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: Connected to 1 node.
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.

 
It looks like there was one success, but rest are failures.

root@rdlab:/home/exx# journalctl -xeu nvidia-fabricmanager.service
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.
░░ Subject: A start job for unit nvidia-fabricmanager.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nvidia-fabricmanager.service has finished with a failure.
░░
░░ The job identifier is 10553 and the job result is failed.
Sep 29 15:23:52 rdlab systemd[1]: Starting NVIDIA fabric manager service...
░░ Subject: A start job for unit nvidia-fabricmanager.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nvidia-fabricmanager.service has begun execution.
░░
░░ The job identifier is 10924.
Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: Connected to 1 node.
Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:23:53 rdlab nv-fabricmanager[16995]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit nvidia-fabricmanager.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Sep 29 15:23:53 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit nvidia-fabricmanager.service has entered the 'failed' state with result 'exit-code'.
Sep 29 15:23:53 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.
░░ Subject: A start job for unit nvidia-fabricmanager.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit nvidia-fabricmanager.service has finished with a failure.
░░
░░ The job identifier is 10924 and the job result is failed.

Log Files:

  • No labels