Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

How to use the GPU validation test provided by Exxact. Tested with NVIDIA cards.

Pre-requisites:

  • NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)

Instructions

  1. Download/unpack files into root directoy

    wget https://s3-us-west-1.amazonaws.com/exxact-support/Test+Folder/Stand_Alone_Validation_v3.1.tar.gz
    tar -xvzf Stand_Alone_Validation_v3.1.tar.gz
  2. Change directory to unpacked folder

    cd Stand_Alone_Validation
  3. Set amount of GPU's/test cycles desired by editing 'run_test.x' file

    nano run_test.x
    
    #How many GPUs in node
    gpu_count=4
    
    #How many tests to run of each type
    #Large test requires 5GB memory
    #Xlarge test requires 11GB memory
    small_test_count=20
    large_test_count=10
    xlarge_test_count=5

    Note: Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.

  4. Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing

  5. Run test in the background by using

    nohup ./run_test.x &
  6. Monitor GPU temps by opening another terminal and using

    nvidia-smi -l

Checking results

View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle.

Example:

exx@ubuntu:~/Stand_Alone_Validation$ ls
clean.x    GPU_1.log  GPU_3.log  lib      nohup.out     output_files_large  run_test.x           standalone-test_v3.bin
GPU_0.log  GPU_2.log  input      LICENSE  output_files  README              standalone-test.bin  standalone-test_v3_p2p.bin
exx@ubuntu:~/Stand_Alone_Validation$ cat *.log
0.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
0.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
1.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
2.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.0:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.1:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.2:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.3:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430
3.4:  Etot   =    -58216.8663  EKtot   =     14421.1768  EPtot      =    -72638.0430

As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results.



  • No labels