Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[INTERNAL USE]

Contents

Table of Contents
minLevel1
maxLevel6
outlinefalse
typelist
printablefalse

...

Code Block
root@rdlab:/home/exx/smc_fieldiag/629-24287-XXXX-FLD-38780# ./fieldiag.sh
Unpacking onediag...
Could not determine HGX baseboard SKU

GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.

Code Block
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ./fieldiag.sh --gpufielddiag
/var/diags/629-24287-XXXX-FLD-38780
Warning: Stopping systemd-udevd.service, but it can still be activated by:
  systemd-udevd-kernel.socket
  systemd-udevd-control.socket
************************************************************
*                                                          *
*                   GPU FIELD DIAGNOSTIC                   *
*                                                          *
************************************************************
Version            629-24287-XXXX-FLD-38780
Logs               /var/diags/629-24287-XXXX-FLD-38780/dgx/logs-20231002-150508/gpu_fd_logs

Running fieldiag...
GPU Devices Under Test: 0:04:00.0 0:23:00.0 0:43:00.0 0:64:00.0 0:84:00.0 0:a3:00.0 0:c3:00.0 0:e4:00.0
Running Parallel Tests

MODS start: Mon Oct  2 15:06:29 2023
GPU 0: RUNNING  GPU 1: RUNNING  GPU 2: RUNNING  GPU 3: RUNNING  GPU 4: RUNNING  GPU 5: RUNNING  GPU 6: RUNNING  GPU 7: RUNNING
GPU 0: RUNNING  GPU 1: RUNNING  GPU 2: FAIL     GPU 3: FAIL     GPU 4: FAIL     GPU 5: FAIL     GPU 6: FAIL     GPU 7: FAIL
Initializing... |====================| 100.0 %
Running test 489 on            GPU 0 [7] [04:00.0] -   2 tests remaining |====================| 99.6 %
Done    test 491 on            GPU 0 [7] [04:00.0] -   0 tests remaining |====================| 100.0 %FAIL     GPU 7: FAIL

Error Code = 000000000000 (ok)

 #######     ####     ######    ######
 ########   ######   ########  ########
 ##    ##  ##    ##  ##     #  ##     #
 ##    ##  ##    ##   ###       ###
 ########  ########    ####      ####
 #######   ########      ###       ###
 ##        ##    ##  #     ##  #     ##
 ##        ##    ##  ########  ########
 ##        ##    ##   ######    ######

MODS end  : Mon Oct  2 23:25:10 2023  [29921.052 seconds (08:18:41.052 h:m:s)]
MODS end  : Mon Oct  2 23:25:10 2023  [29921.081 seconds (08:18:41.081 h:m:s)]
GPU 0: PENDING  GPU 1: PENDING  GPU 2: FAIL     GPU 3: FAIL     GPU 4: FAIL     GPU 5: FAIL     GPU 6: FAIL     GPU 7: FAIL
ls: cannot access '__fieldiag2/*.log': No such file or directory
ls: cannot access '__fieldiag3/*.log': No such file or directory
ls: cannot access '__fieldiag4/*.log': No such file or directory
ls: cannot access '__fieldiag5/*.log': No such file or directory
ls: cannot access '__fieldiag6/*.log': No such file or directory
ls: cannot access '__fieldiag7/*.log': No such file or directory
----------------------
Fieldiag Testing Completed
GPU 0: PASS     GPU 1: PASS     GPU 2: FAIL     GPU 3: FAIL     GPU 4: FAIL     GPU 5: FAIL     GPU 6: FAIL     GPU 7: FAIL

Results Summary
GPU ID    |       GPU SN#      |   STATUS
===============================================
GPU0      |   1650723008686    |    PASS
GPU1      |   1650723001355    |    PASS

 #######     ####     ######    ######
 ########   ######   ########  ########
 ##    ##  ##    ##  ##     #  ##     #
 ##    ##  ##    ##   ###       ###
 ########  ########    ####      ####
 #######   ########      ###       ###
 ##        ##    ##  #     ##  #     ##
 ##        ##    ##  ########  ########
 ##        ##    ##   ######    ######

Done
Failed to send reload request: No such file or directory

Test methods requested by Supermicro from CRM ticket to focus on the NVSwitch issue.

Run the following command and change the test speed value in the connectivity JSON section to Gen 3 - 8000.

./fieldiag.sh --no_bmc --sit # or --level1 or --level2

spec_hopper-hgx-8-gpu_sit_field.json
spec_hopper-hgx-8-gpu_level1_field.json
spec_hopper-hgx-8-gpu_level2_field.json

Section to alter speed value.

...

Example of running Field Test showing logs output location.

...