[INTERNAL USE]
Contents
Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
...
Code Block |
---|
root@rdlab:/home/exx/smc_fieldiag/629-24287-XXXX-FLD-38780# ./fieldiag.sh Unpacking onediag... Could not determine HGX baseboard SKU |
GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.
Code Block |
---|
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ./fieldiag.sh --gpufielddiag
/var/diags/629-24287-XXXX-FLD-38780
Warning: Stopping systemd-udevd.service, but it can still be activated by:
systemd-udevd-kernel.socket
systemd-udevd-control.socket
************************************************************
* *
* GPU FIELD DIAGNOSTIC *
* *
************************************************************
Version 629-24287-XXXX-FLD-38780
Logs /var/diags/629-24287-XXXX-FLD-38780/dgx/logs-20231002-150508/gpu_fd_logs
Running fieldiag...
GPU Devices Under Test: 0:04:00.0 0:23:00.0 0:43:00.0 0:64:00.0 0:84:00.0 0:a3:00.0 0:c3:00.0 0:e4:00.0
Running Parallel Tests
MODS start: Mon Oct 2 15:06:29 2023
GPU 0: RUNNING GPU 1: RUNNING GPU 2: RUNNING GPU 3: RUNNING GPU 4: RUNNING GPU 5: RUNNING GPU 6: RUNNING GPU 7: RUNNING
GPU 0: RUNNING GPU 1: RUNNING GPU 2: FAIL GPU 3: FAIL GPU 4: FAIL GPU 5: FAIL GPU 6: FAIL GPU 7: FAIL
Initializing... |====================| 100.0 %
Running test 489 on GPU 0 [7] [04:00.0] - 2 tests remaining |====================| 99.6 %
Done test 491 on GPU 0 [7] [04:00.0] - 0 tests remaining |====================| 100.0 %FAIL GPU 7: FAIL
Error Code = 000000000000 (ok)
####### #### ###### ######
######## ###### ######## ########
## ## ## ## ## # ## #
## ## ## ## ### ###
######## ######## #### ####
####### ######## ### ###
## ## ## # ## # ##
## ## ## ######## ########
## ## ## ###### ######
MODS end : Mon Oct 2 23:25:10 2023 [29921.052 seconds (08:18:41.052 h:m:s)]
MODS end : Mon Oct 2 23:25:10 2023 [29921.081 seconds (08:18:41.081 h:m:s)]
GPU 0: PENDING GPU 1: PENDING GPU 2: FAIL GPU 3: FAIL GPU 4: FAIL GPU 5: FAIL GPU 6: FAIL GPU 7: FAIL
ls: cannot access '__fieldiag2/*.log': No such file or directory
ls: cannot access '__fieldiag3/*.log': No such file or directory
ls: cannot access '__fieldiag4/*.log': No such file or directory
ls: cannot access '__fieldiag5/*.log': No such file or directory
ls: cannot access '__fieldiag6/*.log': No such file or directory
ls: cannot access '__fieldiag7/*.log': No such file or directory
----------------------
Fieldiag Testing Completed
GPU 0: PASS GPU 1: PASS GPU 2: FAIL GPU 3: FAIL GPU 4: FAIL GPU 5: FAIL GPU 6: FAIL GPU 7: FAIL
Results Summary
GPU ID | GPU SN# | STATUS
===============================================
GPU0 | 1650723008686 | PASS
GPU1 | 1650723001355 | PASS
####### #### ###### ######
######## ###### ######## ########
## ## ## ## ## # ## #
## ## ## ## ### ###
######## ######## #### ####
####### ######## ### ###
## ## ## # ## # ##
## ## ## ######## ########
## ## ## ###### ######
Done
Failed to send reload request: No such file or directory |
Test methods requested by Supermicro from CRM ticket to focus on the NVSwitch issue.
Run the following command and change the test speed value in the connectivity JSON section to Gen 3 - 8000.
./fieldiag.sh --no_bmc --sit # or --level1 or --level2
spec_hopper-hgx-8-gpu_sit_field.json
spec_hopper-hgx-8-gpu_level1_field.json
spec_hopper-hgx-8-gpu_level2_field.json
Section to alter speed value.
...
Example of running Field Test showing logs output location.
...