Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[INTERNAL USE]Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

Table of Contents

Table of Contents

HOW TO INSTALL TOOL

Tool File Name: 629-24287-XXXX-FLD-38780.tgz
Download Location: scp root@172.25.10.35:/root/629-24287-XXXX-FLD-38780.tgz .

PROBLEM SITUATION

Supermicro provided this file to diagnose HGX H100 GPU issues. Related to ZD-6179 / SMC CRM Case: SM2310022368.

The reported issue related to 4x NVSwitch used by 8x H100 GPU. The problem was that only 3 of 4 NVSwitches was recognized. Due to this problem Fabric Manager Service was unable to run.

...

Code Block
root@rdlab:/home/exx# systemctl enable nvidia-fabricmanager.service
root@rdlab:/home/exx# systemctl start nvidia-fabricmanager.service
Job for nvidia-fabricmanager.service failed because the control process exited with error code.
See "systemctl status nvidia-fabricmanager.service" and "journalctl -xeu nvidia-fabricmanager.service" for details.

 Checked status.

Code Block
root@rdlab:/home/exx# systemctl status nvidia-fabricmanager.service
× nvidia-fabricmanager.service - NVIDIA fabric manager service
     Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Fri 2023-09-29 15:18:28 PDT; 1min 41s ago
    Process: 16773 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
        CPU: 16ms

Sep 29 15:18:27 rdlab systemd[1]: Starting NVIDIA fabric manager service...
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: Connected to 1 node.
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:18:28 rdlab nv-fabricmanager[16775]: detected number of NVSwitches don't match with any supported system topology, aborting fabric ma>
Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Sep 29 15:18:28 rdlab systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Sep 29 15:18:28 rdlab systemd[1]: Failed to start NVIDIA fabric manager service.

...

View file
nameRMA28206-fabricmanager.log
View file
nameRMA28206-NVSwitch Detection..txt

INVESTIGATION DETAILS

The

Tool Installation