HGX-2 Field Diagnostics Tool
Table of Contents
Document Scope
Support/RMA inquiry for HGX-2 systems with SXM4 GPU's will require all of the following:
- Run “lspci | grep –i Nvidia”
- Run “nvidia-smi”
- Run “Nvidia-bug-report.sh” to generate Nvidia-bug-report.log
- IPMI page of GPU
- Run “fielddiag” (./fieldiag --no_bmc –skip_os_check (IOMMU disabled in bios as well)
- See HGX-2 Fielddiag tool below
If GPU is recognized, and simply outputs ECC errors, run 'nvidia-smi -i <GPU#> -q' for it.
Instructions on how to obtain and run the HGX-2 platform field diagnostics tool below.
Where to obtain HGX-2 Fielddiag tool
Latest:
https://box.supermicro.com/index.php/s/4YwW7MmNb5CXakW
Password: SImN4h51
How I loaded the tool onto a USB
- Format the target drive with GPT (GUID partition table) and add a FAT32
partition. For USB drives, this can be done with
[Rufus](http://rufus.akeo.ie).- When formatting a drive with Rufus, select `GPT partition scheme for UEFI
computer` and leave the `Create a bootable disk` option unchecked. Also
choose FAT32 file system type.
- When formatting a drive with Rufus, select `GPT partition scheme for UEFI
- Unpack the contents of the chosen profile package (e.g. full.zip) directly
to the root directory of the formatted drive.- For example, right click on the zip file in Explorer, choose
"Extract all..." and enter drive letter (e.g. e:) as the destination.
- For example, right click on the zip file in Explorer, choose
- When booting NVIDIA TinyLinux, make sure the "Secure boot" option is
disabled in UEFI setup.
Instructions from readme
Installation
-------------
Please follow these steps:
1) cd /var/diags/ (If the current directory is not /var/diags/)
2) tar xfz 629-FLDXX-YYYY-ZZZ.tgz
Running field diagnostics
Make sure IOMMU BIOS setting is disabled or it will not generate proper logs
-------------------------
Please follow these steps:
1) cd 629-FLDXX-YYYY-ZZZ
2) ./fieldiag.sh <options>
3) PASS/FAIL/RETEST will be displayed when fieldiag.sh finishes execution.
Please follow these steps:
1) cd 629-FLDXX-YYYY-ZZZ
2) ./fieldiag.sh <options>
3) PASS/FAIL/RETEST will be displayed when fieldiag.sh finishes execution.
Note: Full diag is expected to complete in ~5.5 hours.
Note: All tests will continue running even if some tests fail by default
Alternatively users can use --fail_on_first_error to fail immediately
Note: All tests will continue running even if some tests fail by default
Alternatively users can use --fail_on_first_error to fail immediately