Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

How to use the GPU validation test provided by Exxact. Tested with NVIDIA cards.

Pre-requisites:

  • NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)

Instructions

  1. Download/unpack files into root directoy

    wget https://s3-us-west-1.amazonaws.com/exxact-support/Test+Folder/Stand_Alone_Validation_v3.1.tar.gz --no-check-certificate
    tar -xvzf Stand_Alone_Validation_v3.1.tar.gz
  2. Change directory to unpacked folder

    cd Stand_Alone_Validation
  3. Set amount of GPU's/test cycles desired by editing 'run_test.x' file

    nano run_test.x
    
    #How many GPUs in node
    gpu_count=4
    
    #How many tests to run of each type
    #Large test requires 5GB memory
    #Xlarge test requires 11GB memory
    small_test_count=20
    large_test_count=10
    xlarge_test_count=5

    Note: Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.

  4. Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing

  5. Run test in the background by using

    nohup ./run_test.x &
  6. Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your set amount of cycles completed.

    exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l
    Tue Jan 15 17:35:14 2019
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 1080    On   | 00000000:05:00.0  On |                  N/A |
    | 78%   86C    P2   149W / 180W |   4767MiB /  8118MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 1080    On   | 00000000:06:00.0 Off |                  N/A |
    | 77%   86C    P2   155W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   2  GeForce GTX 1080    On   | 00000000:09:00.0 Off |                  N/A |
    | 72%   86C    P2   124W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    |   3  GeForce GTX 1080    On   | 00000000:0A:00.0 Off |                  N/A |
    | 59%   83C    P2   134W / 180W |   4569MiB /  8119MiB |    100%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |    0      1910      G   /usr/lib/xorg/Xorg                           157MiB |
    |    0      2889      G   compiz                                        40MiB |
    |    0      5848      C   ../standalone-test.bin                      4557MiB |
    |    1      5849      C   ../standalone-test.bin                      4557MiB |
    |    2      5850      C   ../standalone-test.bin                      4557MiB |
    |    3      5851      C   ../standalone-test.bin                      4557MiB |
    +-----------------------------------------------------------------------------+

As for the time it takes per cycle, I have not yet measured them per small, large, or xlarge cycles. I assume with the 5/5/2 cycles, it will complete in 6-8 hours. 

Checking results

View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle. In this example, I only had 5 small tests on 4x GPU's. The large and Xlarge tests write their own files per GPU_x.

Example:

exx@ubuntu:~/Stand_Alone_Validation$ ls
clean.x    GPU_1.log  GPU_3.log  lib      nohup.out     output_files_large  run_test.x           standalone-test_v3.bin
GPU_0.log  GPU_2.log  input      LICENSE  output_files  README              standalone-test.bin  standalone-test_v3_p2p.bin
exx@ubuntu:~/Stand_Alone_Validation$ cat *.log
0.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430

As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results.



  • No labels