Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

How to use the GPU validation test provided by Exxact. Tested with NVIDIA cards.

Pre-requisites:

  • NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)

Instructions

Download/unpack files into root directoy

Code Block
languagejava
themeEmacs
wget https://s3-us-west-1.amazonaws.com/exxact-support/Test+Folder/Stand_Alone_Validation_v3.1.tar.gz --no-check-certificate
tar -xvzf Stand_Alone_Validation_v3.1.tar.gz

...

How to use the GPU validation test. Tested with NVIDIA cards.

Pre-requisites:

  • NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)

Instructions

  1. Download/unpack files into root directoy

    Code Block
    languagejava
    themeEmacs
    cd Stand_Alone_Validation

    Set amount of GPU's/test cycles desired by editing 'run_test.x' file

    Code Block
    languagejava
    themeEmacs
    nano run_test.x
    
    #How many GPUs in node
    gpu_count=4
    
    #How many tests to run of each type
    #Large test requires 5GB memory
    #Xlarge test requires 11GB memory
    small_test_count=20
    large_test_count=10
    xlarge_test_count=5
    Note:
    wget https://exxact-support.s3.us-west-1.amazonaws.com/Test+Folder/Stand_Alone_Validation_v4.2.1.tar.gz --no-check-certificate
    tar -xvzf Stand_Alone_Validation_v4.2.1.tar.gz



  2. Change directory to unpacked folder

    Code Block
    languagejava
    themeEmacs
    cd Stand_Alone_Validation



    Info

    Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.

  3. Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing


  4. Run test in the background by using (run as root)

    Code Block
    languagejava
    themeEmacs
    nohup ./run_test.x &


  5. Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your set amount of cycles completed.

    Code Block
    languagejava
    themeEmacs
    exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l
    Tue Jan 15 17:35:14 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 1080    On   | 00000000:05:00.0  On |                  N/A |
    | 78%   86C    P2   149W / 180W |   4767MiB /  8118MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 1080    On   | 00000000:06:00.0 Off |                  N/A |
    | 77%   86C    P2   155W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  GeForce GTX 1080    On   | 00000000:09:00.0 Off |                  N/A |
    | 72%   86C    P2   124W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  GeForce GTX 1080    On   | 00000000:0A:00.0 Off |                  N/A |
    | 59%   83C    P2   134W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      1910      G   /usr/lib/xorg/Xorg                           157MiB |
    |    0      2889      G   compiz                                        40MiB |
    |    0      5848      C   ../standalone-test.bin                      4557MiB |
    |    1      5849      C   ../standalone-test.bin                      4557MiB |
    |    2      5850      C   ../standalone-test.bin                      4557MiB |
    |    3      5851      C   ../standalone-test.bin                      4557MiB |
    +-----------------------------------------------------------------------------+


...

Code Block
languagejava
themeEmacs
exx@ubuntu:~/Stand_Alone_Validation$ ls clean.x    GPU_1.log  GPU_3.log  lib      nohup.out     output_files_large  run_test.x./exx-getgpu-validation.sh


The test results will be saved in /tmp/<hostname>_Standard_GPU_validation.txt. View the file and copy the results to the Support Ticket if applicable. 


Interpreting Results

Code Block
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-886818ec-0907-a70e-613c-9a34d1a3398f)


Validation Results:
./GPU_0.log        : 20, Etot   = standalone-test_v3.bin GPU_0.log  GPU_2.log-58222.0688  inputEKtot   =   LICENSE  output_files14396.2812  READMEEPtot          =    standalone-test72618.bin 3500 standalone-test_v3_p2p.bin
exx@ubuntu:~/Stand_Alone_Validation$ cat *.log
0.0:  Etot   =    -58216.8663-> Passed
1 card(s) valided for Normal Test

./GPU.large_0.log  : 10, Etot   =  -2708653.0371  EKtot   =     14421662946.17688750  EPtot      =    -726383371599.0430
0.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430

As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results.

info
9121 -> Passed
1 card(s) valided for Large Test

./GPU.xlarge_0.log :  5, Etot   =  -8862400.5831  EKtot   =   2171066.2500  EPtot      = -11033466.8331 -> Passed
1 card(s) valided for XLarge Test


Performance Results:
Location = .
GPU =      0
Normal=======
High= 506.41
Low = 502.46
Avg = 503.92
Diff=   3.95
Pts =   0.78
 Large=======
High=  24.71
Low =  24.62
Avg =  24.67
Diff=   0.09
Pts =   0.36
XLarge=======
High=  12.32
Low =  12.31
Avg =  12.31
Diff=   0.01
Pts =   0.08

Validation Results

This section determines if the GPU is calculating results consistently.

The test logs the output the GPU_N.log, GPU.large_N.log, GPU.xlarge_N.log, respectively. Every time the calculation is run the resulting value should be the same. This section of the script confirms the values are the same for the target GPU

Performance Results

This section determines if the GPU is performing consistently.

The test calculates the average ns/day at which the GPU is performing. In the context of high-performance computing and molecular dynamics simulations, ns/day refers to the number of nanoseconds (ns) of simulation time that you can compute in a single day of real-world time. It’s a useful metric for estimating how much simulation progress you can achieve within a given timeframe. The high the better.

  • High - the highest metric observed
  • Low - the lowest metric observed
  • Avg - the average of metrics observed
  • Diff - the difference between the High and Low values
  • Pts - The percentage of the difference / High values. 
    • This number should be low and not more than 5%.
    • A high value points to an issue.
      • Check GPU temp and ensure there is sufficient airflow to the GPU. Turn the fans up to full and retest. 
      • Swap GPUs and retest to see if the issue follows the GPU or the PCie slot.




Expand
titleAbout Exxact's Standalone Validation Suite

Exxact's Standalone Validation Suite is a proprietary test adapted from the GPU engine within the AMBER Molecular Dynamics Software Suite. Developed by Ross Walker, the principal developer of the AMBER GPU software, the test works by repeatedly running all atom molecular dynamics simulations (MD) of varying size. There are 3 different size of test designed to stress both the GPU itself and the GPU memory. For each test size a simulation is run that consists of millions of MD steps, each comprising a large combination of single and double precision floating pointing calculations as well as fixed precision integer arithmetic. The calculation includes pair wise electrostatic and van der Waals interactions, Fourier Transforms, inverse R squared calculations, pair list sorts and integration. This computation pattern uses all parts of the GPU and also stresses the GPU memory. At the end of a fixed number of steps for each run, which averages between 15 and 30 mins the final coordinates, energies and velocities of the atoms are recorded. The calculation is then repeated from the same input parameters and again after a fixed number of steps the final coordinates, energies and velocities of the atoms are recorded. The AMBER GPU engine is designed to be bitwise reproducible which means that a simulation started from identical conditions should give identical results. Any variation in the final results is thus an indication of either a bad GPU or bad GPU memory. The test is run for a total of 24 hours and is very effective at identifying faulty GPUs. So effective in fact that it is credited with identifying design flaws and insufficient frequency margins on 5 different NVIDIA GPU models and NVIDIA now includes a variation of this code as part of their chip design testing process. In addition to checking that all GPUs give consistent results the performance of each GPU is tested using the same code. Performance between repeat runs and between GPUs is compared and determined to be within acceptable tolerances before a system is shipped. This approach effectively identifies both faulty GPUs, for example with faulty power and temperature regulators, and any GPUs that might have insufficient cooling due to air flow restrictions, fan issues etc.


Filter by label (Content by label)
showLabelsfalse
max5
spacescom.atlassian.confluence.content.render.xhtml.model.resource.identifiers.SpaceResourceIdentifier@209e85
showSpacefalse
sortmodified
reversetrue
typepage
cqllabel in ( "validation" , "test" , "gpu" ) and type = "page" and space = "ESKB"
labelsGPU validation test

...