How to use the GPU validation test provided by Exxact. Tested with NVIDIA cards.
Pre-requisites:
- NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)
Instructions
Download/unpack files into root directoy
wget https://s3-us-west-1.amazonaws.com/exxact-support/Test+Folder/Stand_Alone_Validation_v3.1.tar.gz tar -xvzf Stand_Alone_Validation_v3.1.tar.gz
Change directory to unpacked folder
cd Stand_Alone_Validation
Set amount of GPU's/test cycles desired by editing 'run_test.x' file
nano run_test.x #How many GPUs in node gpu_count=4 #How many tests to run of each type #Large test requires 5GB memory #Xlarge test requires 11GB memory small_test_count=20 large_test_count=10 xlarge_test_count=5
Note: Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.
Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing
Run test in the background by using
nohup ./run_test.x &
Monitor GPU temps by opening another terminal and using
nvidia-smi -l
Checking results
View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle.
Example:
exx@ubuntu:~/Stand_Alone_Validation$ ls clean.x GPU_1.log GPU_3.log lib nohup.out output_files_large run_test.x standalone-test_v3.bin GPU_0.log GPU_2.log input LICENSE output_files README standalone-test.bin standalone-test_v3_p2p.bin exx@ubuntu:~/Stand_Alone_Validation$ cat *.log 0.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430
As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results.
Related articles