HGX-2 Field Diagnostics Tool

Table of Contents

Document Scope

Support/RMA inquiry for HGX-2 systems with SXM4 GPU's will require all of the following:

  1. Run “lspci  | grep –i Nvidia”
  2. Run “nvidia-smi”
  3. Run “Nvidia-bug-report.sh” to generate Nvidia-bug-report.log
  4. IPMI page of GPU
  5. Run “fielddiag” (./fieldiag --no_bmc –skip_os_check (IOMMU disabled in bios as well)
    • See HGX-2 Fielddiag tool below

If GPU is recognized, and simply outputs ECC errors, run 'nvidia-smi -i <GPU#> -q' for it.

Instructions on how to obtain and run the HGX-2 platform field diagnostics tool below.

Where to obtain HGX-2 Fielddiag tool

Latest:

https://box.supermicro.com/index.php/s/4YwW7MmNb5CXakW

Password: SImN4h51

How I loaded the tool onto a USB

  1. Format the target drive with GPT (GUID partition table) and add a FAT32
    partition. For USB drives, this can be done with
    [Rufus](http://rufus.akeo.ie).
    • When formatting a drive with Rufus, select `GPT partition scheme for UEFI
      computer` and leave the `Create a bootable disk` option unchecked. Also
      choose FAT32 file system type.
  2. Unpack the contents of the chosen profile package (e.g. full.zip) directly
    to the root directory of the formatted drive.
    • For example, right click on the zip file in Explorer, choose
      "Extract all..." and enter drive letter (e.g. e:) as the destination.
  3. When booting NVIDIA TinyLinux, make sure the "Secure boot" option is
    disabled in UEFI setup.

Instructions from readme


Installation
-------------

Please follow these steps:
1) cd /var/diags/ (If the current directory is not /var/diags/)
2) tar xfz 629-FLDXX-YYYY-ZZZ.tgz


Running field diagnostics
Make sure IOMMU BIOS setting is disabled or it will not generate proper logs
-------------------------

Please follow these steps:
1) cd 629-FLDXX-YYYY-ZZZ
2) ./fieldiag.sh <options>
3) PASS/FAIL/RETEST will be displayed when fieldiag.sh finishes execution.
Note: Full diag is expected to complete in ~5.5 hours.
Note: All tests will continue running even if some tests fail by default
Alternatively users can use --fail_on_first_error to fail immediately
 Example of complete logs