[INTERNAL USE]
Contents
Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
...
Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre Ampere (A100 GPU): 629-2268723587-XX86-FLD-3822538782.tgz
Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-2268723587-XX86-FLD-3822538782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .
...
Code Block |
---|
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll total 473124 drwxr-xr-x 2 root root 4096 Oct 4 10:55 ./ drwxr-xr-x 3 exx exx 4096 Oct 4 10:55 ../ -rwxr-xr-x 1 exx exx 17895 Sep 11 09:47 fdmain.sh* -rwxr-xr-x 1 exx exx 32888 Sep 11 09:47 fieldiag.sh* -r-xr-xr-x 1 exx exx 11194648 Sep 11 09:47 nvflash* -rwxr-xr-x 1 exx exx 473142559 Sep 11 09:47 onediagfield.r6.252.tgz* -r-xr-xr-x 1 exx exx 2906 Sep 11 09:47 README.txt* -r-xr-xr-x 1 exx exx 1702 Sep 11 09:47 relnotes.txt* -r-xr-xr-x 1 exx exx 18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json* -r-xr-xr-x 1 exx exx 18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json* -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- 1 exx exx 2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json -r-xr-xr-x 1 exx exx 6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json* |
Extracted folder (AmpreAmpere) content:
Code Block |
---|
root@rdlab:/var/diags/629-2268723587-XX86-FLD-38225#38782# ll total 243812243940 drwxr-xr-x 24 root root 4096 FebMar 124 22:2755 ./ drwxr-xr-x 34 root root 4096 Mar Feb 125 2217:2703 ../ drwxr-rwxr-xr-x 18 exxroot root exx 4096 31232Mar Sep 134 22:39 fieldiag.sh*55 dgx/ -rwxrrw-r-xr-x -r-- 1 root root 0 Mar 4 22:55 dgx_log_creation_lock -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_unpack_package_lock -rw-r--r-- 1 root root 26360 Mar 5 00:43 fieldiag.log -rwxr-xr-x 1 exx exx 31232 Sep 19 20:53 fieldiag.sh* -rwxr-xr-x 1 exx exx 238456629 Sep 19 20:53 238455527hgxfieldiag.r3.102* drwxr-xr-x 2 root root 4096 Mar 5 00:43 logs/ -r-xr-xr-x 1 exx exx 11104504 Sep 1319 2220:39 hgxfieldiag.r3.10053 nvflash* -rrwxr-xr-xr-x 1 exx exx 11115992 2773 Sep 1319 2220:3953 nvflashREADME.txt* -rwxr-xr-x 1 exx exx 27194497 Sep 1319 2220:3953 READMErelnotes.txt* -rwxr-xr-x 1 exx exx 16501823 Sep 1319 2220:39 relnotes.txt53 sku_hgx-a100-8-gpu_40g_aircooled.json* -rwxr-xr-x 1 exx exx 10461482 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g_aircooledhybrid.json* -rwxr-xr-x 1 exx exx 10463787 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g80g_liquidcooledaircooled.json* -rwxr-xr-x 1 exx exx 16763611 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_64g80g_hybrid.json* -rwxr-xr-x 1 exx exx 25076 1876 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80g_aircooled2tray.json* -rwxr-xr-x 1 exx exx 24734 1798 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80gd00_liquidcooled2tray.json* -rwxr-xr-x 1 exx exx 14310 1048 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_96gd00.json* -rwxr-xr-x 1 exx exx 14652 8618 Sep 1319 2220:3953 testargs_hgx-a100-48-gpu.json* |
PROBLEM SITUATION
...
View file | ||
---|---|---|
|
If the script does not correct the issue, then uninstall the existing Nvidia driver on the system. The existing Nvidia driver could be conflicting with this tool.
GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.
...
Example of running Field Test showing logs output location.
...
Failure example from ZD-12288: fieldiag.log
View file | ||
---|---|---|
|