Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

How to use the GPU validation test. Tested with NVIDIA cards.

Pre-requisites:

  • NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)

Instructions

  1. Download/unpack files into root directoy

    Use these commands id you have a 20XX series/Turing based GPU:

    wget https://s3-us-west-1.amazonaws.com/exxact-support/Test+Folder/Stand_Alone_Validation_v3.1.tar.gz --no-check-certificate
    tar -xvzf Stand_Alone_Validation_v3.1.tar.gz
    


    Use these commands id you have a 30XX series/Ampere based GPU:

    wget https://exxact-disk-images.s3-us-west-1.amazonaws.com/AMBER+Stand+Alone+Test/Stand_Alone_Validation_v4.0.tar.gz --no-check-certificate
    tar -xvzf Stand_Alone_Validation_v4.0.tar.gz
  2. Change directory to unpacked folder

    cd Stand_Alone_Validation
  3. Set amount of GPU's/test cycles desired by editing 'run_test.x' file

    nano run_test.x
    
    #How many GPUs in node
    gpu_count=4
    
    #How many tests to run of each type
    #Large test requires 5GB memory
    #Xlarge test requires 11GB memory
    small_test_count=20
    large_test_count=10
    xlarge_test_count=5

    Note: Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.

  4. Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing

  5. Run test in the background by using (run as root)

    nohup ./run_test.x &
  6. Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your set amount of cycles completed.

    exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l
    Tue Jan 15 17:35:14 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 1080    On   | 00000000:05:00.0  On |                  N/A |
    | 78%   86C    P2   149W / 180W |   4767MiB /  8118MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 1080    On   | 00000000:06:00.0 Off |                  N/A |
    | 77%   86C    P2   155W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  GeForce GTX 1080    On   | 00000000:09:00.0 Off |                  N/A |
    | 72%   86C    P2   124W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  GeForce GTX 1080    On   | 00000000:0A:00.0 Off |                  N/A |
    | 59%   83C    P2   134W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      1910      G   /usr/lib/xorg/Xorg                           157MiB |
    |    0      2889      G   compiz                                        40MiB |
    |    0      5848      C   ../standalone-test.bin                      4557MiB |
    |    1      5849      C   ../standalone-test.bin                      4557MiB |
    |    2      5850      C   ../standalone-test.bin                      4557MiB |
    |    3      5851      C   ../standalone-test.bin                      4557MiB |
    +-----------------------------------------------------------------------------+

As for the time it takes per cycle, I have not yet measured them per small, large, or xlarge cycles. I assume with the 5/5/2 cycles, it will complete in 6-8 hours. 

Checking results

View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle. In this example, I only had 5 small tests on 4x GPU's. The large and Xlarge tests write their own files per GPU_x.

Example:

exx@ubuntu:~/Stand_Alone_Validation$ ls
clean.x    GPU_1.log  GPU_3.log  lib      nohup.out     output_files_large  run_test.x           standalone-test_v3.bin
GPU_0.log  GPU_2.log  input      LICENSE  output_files  README              standalone-test.bin  standalone-test_v3_p2p.bin
exx@ubuntu:~/Stand_Alone_Validation$ cat *.log
0.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430


As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results.

FAQ

Is there anyway of running the test on select cards without going through the trouble of opening the case and yanking out power cables/PCIe cards?


Answer is yes.
Yes you can.

This involves a manual declaration of the env vars, and an adjustment of the script to comment 'CUDA_VISIBLE_DEVICES' out, so this does not over-write the UUID of the GPU of the single GPU card to be tested.
This is an applicable solution for a system admin who is comfortable working in the shell or CLI, and the Exxact GPU server or HPC is in a rack or data-center environment.


Expand the content section below to read more.

 Click here to expand...

To run the GPU Stand Alone Validation tests against a single card-- we must customize the behavior of the script instead of pulling out the cards and rotating them manually.
It does involve a manual change to the GPU validation script, but I tested this in my lab and it worked as expected.

To run the test against one specific card, you will need to perform the following actions:

  1. Back-up the existing "run_test.x" shell script (just be safe, you can always re-download the entire tgz archive again)

  2. Edit the "run_test.x" using your favorite text editor (nano, vim , etc).

  3. #Comment out "CUDA_VISIBLE_DEVICES=$j",

This is seen (3) times in the run_test script. We are removing it here, because we will define this directly in the bash shell so we don't need to edit this file for each and every run.

  1. Run command, "nvidia-smi -L" to get list of all GPU UUIDs.

  2. For each card, before each run, you will set the GPU UUID for the card you wish to test.

e.g.

export CUDA_VISIBLE_DEVICES=GPU-99135ce
nohup ./run_test.x &
    ... [Test completes]

export CUDA_VISIBLE_DEVICES=GPU-13599aa
{{ repeat as needed to isolate faulty GPU }}





  • No labels