Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

[INTERNAL USE]

Contents

Table of Contents
minLevel1
maxLevel6
outlinefalse
typelist
printablefalse

...

Tool File Name:
Hopper (H100 GPU): 629-24287-XXXX-FLD-38780.tgz
Ampre Ampere (A100 GPU): 629-2268723587-XX86-FLD-3822538782.tgz

Download Location (INTERNAL QA SERVER):
Hopper (H100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-24287-XXXX-FLD-38780.tgz .
Ampre Amepre (A100 GPU): scp root@172.25.10.35:/root/HGX_Tool/629-2268723587-XX86-FLD-3822538782.tgz .
Unload Nvidia Driver: scp root@172.25.10.35:/root/HGX_Tool/unload_nvidia_driver.sh .

...

Code Block
root@rdlab:/var/diags/629-24287-XXXX-FLD-38780# ll
total 473124
drwxr-xr-x 2 root root      4096 Oct  4 10:55 ./
drwxr-xr-x 3 exx  exx       4096 Oct  4 10:55 ../
-rwxr-xr-x 1 exx  exx      17895 Sep 11 09:47 fdmain.sh*
-rwxr-xr-x 1 exx  exx      32888 Sep 11 09:47 fieldiag.sh*
-r-xr-xr-x 1 exx  exx   11194648 Sep 11 09:47 nvflash*
-rwxr-xr-x 1 exx  exx  473142559 Sep 11 09:47 onediagfield.r6.252.tgz*
-r-xr-xr-x 1 exx  exx       2906 Sep 11 09:47 README.txt*
-r-xr-xr-x 1 exx  exx       1702 Sep 11 09:47 relnotes.txt*
-r-xr-xr-x 1 exx  exx      18541 Sep 11 09:47 sku_hopper-hgx-8-gpu.json*
-r-xr-xr-x 1 exx  exx      18477 Sep 11 09:47 sku_hopper-hgx-8-gpu_tpol.json*
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level1_field.json
-rw-rw-r-- 1 exx  exx       3428 Sep 11 09:47 spec_hopper-hgx-8-gpu_level2_field.json
-rw-rw-r-- 1 exx  exx       2312 Sep 11 09:47 spec_hopper-hgx-8-gpu_sit_field.json
-r-xr-xr-x 1 exx  exx       6832 Sep 11 09:47 testargs_hopper-hgx-8-gpu.json*

Extracted folder (AmpreAmpere) content:

Code Block
root@rdlab:/var/diags/629-2268723587-XX86-FLD-38225#38782# ll
total 243812243940
drwxr-xr-x 24 root root      4096 FebMar  124 22:2755 ./
drwxr-xr-x 34 root root      4096 Mar Feb 125 2217:2703 ../
drwxr-rwxr-xr-x 18 exxroot root exx     4096 31232Mar Sep 134 22:39 fieldiag.sh*55 dgx/
-rwxrrw-r-xr-x -r-- 1 root root         0 Mar  4 22:55 dgx_log_creation_lock
-rw-r--r-- 1 root root         0 Mar  4 22:55 dgx_unpack_package_lock
-rw-r--r-- 1 root root     26360 Mar  5 00:43 fieldiag.log
-rwxr-xr-x 1 exx  exx      31232 Sep 19 20:53 fieldiag.sh*
-rwxr-xr-x 1 exx  exx  238456629 Sep 19 20:53  238455527hgxfieldiag.r3.102*
drwxr-xr-x 2 root root      4096 Mar  5 00:43 logs/
-r-xr-xr-x 1 exx  exx   11104504 Sep 1319 2220:39 hgxfieldiag.r3.10053 nvflash*
-rrwxr-xr-xr-x 1 exx  exx      11115992 2773 Sep 1319 2220:3953 nvflashREADME.txt*
-rwxr-xr-x 1 exx  exx       27194497 Sep 1319 2220:3953 READMErelnotes.txt*
-rwxr-xr-x 1 exx  exx       16501823 Sep 1319 2220:39 relnotes.txt53 sku_hgx-a100-8-gpu_40g_aircooled.json*
-rwxr-xr-x 1 exx  exx       10461482 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g_aircooledhybrid.json*
-rwxr-xr-x 1 exx  exx       10463787 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_40g80g_liquidcooledaircooled.json*
-rwxr-xr-x 1 exx  exx       16763611 Sep 1319 2220:3953 sku_hgx-a100-48-gpu_64g80g_hybrid.json*
-rwxr-xr-x 1 exx  exx      25076 1876 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80g_aircooled2tray.json*
-rwxr-xr-x 1 exx  exx      24734 1798 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_80gd00_liquidcooled2tray.json*
-rwxr-xr-x 1 exx  exx      14310 1048 Sep 1319 2220:3953 skutestargs_hgx-a100-48-gpu_96gd00.json*
-rwxr-xr-x 1 exx  exx      14652 8618 Sep 1319 2220:3953 testargs_hgx-a100-48-gpu.json*

PROBLEM SITUATION

...

View file
nameunload_nvidia_driver.sh

If the script does not correct the issue, then uninstall the existing Nvidia driver on the system. The existing Nvidia driver could be conflicting with this tool.

GPU Field Diagnostic test failure example. Only 2 of 8 GPU’s were properly identified for testing.

...

Example of running Field Test showing logs output location.

...

Failure example from ZD-12288: fieldiag.log

View file
nameS434992x3806626-fieldiag-FAILED-8xA100.log