[INTERNAL USE]
Contents
Table of Contents | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
...
Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre Ampere (A100 GPU): 629-2268723587-XX86-FLD-3822538782.tgz
Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-2268723587-XX86-FLD-3822538782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .
The tool is expected to be placed in the /var/diags
folder. Created this folder if it does not exist.
...
Code Block |
---|
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll total 473124 drwxr-xr-x 2 root root 4096 Oct 4 10:55 ./ drwxr-xr-x 3 exx exx 4096 Oct 4 10:55 ../ -rwxr-xr-x 1 exx exx 17895 Sep 11 09:47 fdmain.sh* -rwxr-xr-x 1 exx exx 32888 Sep 11 09:47 fieldiag.sh* -r-xr-xr-x 1 exx exx 11194648 Sep 11 09:47 nvflash* -rwxr-xr-x 1 exx exx 473142559 Sep 11 09:47 onediagfield.r6.252.tgz* -r-xr-xr-x 1 exx exx 2906 Sep 11 09:47 README.txt* -r-xr-xr-x 1 exx exx 1702 Sep 11 09:47 relnotes.txt* -r-xr-xr-x 1 exx exx 18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json* -r-xr-xr-x 1 exx exx 18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json* -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json -rw-rw-r-- 1 exx exx 3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json -rw-rw-r-- 1 exx exx 2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json -r-xr-xr-x 1 exx exx 6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json* |
Extracted folder (AmpreAmpere) content:
Code Block |
---|
root@rdlab:/var/diags/629-2268723587-XX86-FLD-38225#38782# ll total 243812243940 drwxr-xr-x 24 root root 4096 FebMar 124 22:2755 ./ drwxr-xr-x 34 root root 4096 Mar Feb 125 2217:2703 ../ drwxr-xr-x 8 root root 4096 Mar 4 22:55 dgx/ -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_log_creation_lock -rw-r--r-- 1 root root 0 Mar 4 22:55 dgx_unpack_package_lock -rw-r--r-- 1 root root 26360 Mar 5 00:43 fieldiag.log -rwxr-xr-x 1 exx exx 31232 Sep 19 20:53 fieldiag.sh* -rwxr-xr-x 1 exx exx 238456629 Sep 19 20:53 hgxfieldiag.r3.102* drwxr-xr-x 2 root root 31232 Sep 13 22:39 fieldiag.sh* -rwxr 4096 Mar 5 00:43 logs/ -r-xr-xr-x 1 exx exx 238455527 11104504 Sep 1319 2220:39 hgxfieldiag.r3.10053 nvflash* -rrwxr-xr-xr-x 1 exx exx 11115992 2773 Sep 1319 2220:3953 nvflashREADME.txt* -rwxr-xr-x 1 exx exx 27194497 Sep 1319 2220:3953 READMErelnotes.txt* -rwxr-xr-x 1 exx exx 16501823 Sep 1319 2220:39 relnotes.txt53 sku_hgx-a100-8-gpu_40g_aircooled.json* -rwxr-xr-x 1 exx exx 10461482 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g_aircooledhybrid.json* -rwxr-xr-x 1 exx exx 10463787 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g80g_liquidcooledaircooled.json* -rwxr-xr-x 1 exx exx 16763611 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_80g_64ghybrid.json* -rwxr-xr-x 1 exx exx 25076 1876 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80g_aircooled2tray.json* -rwxr-xr-x 1 exx exx 24734 1798 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80gd00_liquidcooled2tray.json* -rwxr-xr-x 1 exx exx 104814310 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_96gd00.json* -rwxr-xr-x 1 exx exx 14652 8618 Sep 1319 2220:3953 testargs_hgx-a100-48-gpu.json* |
PROBLEM SITUATION
...
View file | ||
---|---|---|
|
...
GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.
...
Example of running Field Test showing logs output location.
...
Failure example from ZD-12288: fieldiag.log
View file | ||
---|---|---|
|