How to use the GPU validation test. Tested with NVIDIA cards.
Pre-requisites:
- NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)
Instructions
Download/unpack files into root directoy
Use these commands id you have a 20XX series GPU: wget https://exxact-disk-images.s3-us-west-1.amazonaws.com/AMBER+Stand+Alone+Test/Stand_Alone_Validation_v4.0.tar.gz --no-check-certificate tar -xvzf Stand_Alone_Validation_v4.0.tar.gz Use these commands id you have a 30XX series GPU: wget https://exxact-disk-images.s3-us-west-1.amazonaws.com/AMBER+Stand+Alone+Test/Stand_Alone_Validation_v4.0.tar.gz --no-check-certificate tar -xvzf Stand_Alone_Validation_v4.0.tar.gz
Change directory to unpacked folder
cd Stand_Alone_Validation
Set amount of GPU's/test cycles desired by editing 'run_test.x' file
nano run_test.x #How many GPUs in node gpu_count=4 #How many tests to run of each type #Large test requires 5GB memory #Xlarge test requires 11GB memory small_test_count=20 large_test_count=10 xlarge_test_count=5
Note: Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.
Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing
Run test in the background by using (run as root)
nohup ./run_test.x &
Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your set amount of cycles completed.
exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l Tue Jan 15 17:35:14 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 On | 00000000:05:00.0 On | N/A | | 78% 86C P2 149W / 180W | 4767MiB / 8118MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 On | 00000000:06:00.0 Off | N/A | | 77% 86C P2 155W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 On | 00000000:09:00.0 Off | N/A | | 72% 86C P2 124W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 On | 00000000:0A:00.0 Off | N/A | | 59% 83C P2 134W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1910 G /usr/lib/xorg/Xorg 157MiB | | 0 2889 G compiz 40MiB | | 0 5848 C ../standalone-test.bin 4557MiB | | 1 5849 C ../standalone-test.bin 4557MiB | | 2 5850 C ../standalone-test.bin 4557MiB | | 3 5851 C ../standalone-test.bin 4557MiB | +-----------------------------------------------------------------------------+
As for the time it takes per cycle, I have not yet measured them per small, large, or xlarge cycles. I assume with the 5/5/2 cycles, it will complete in 6-8 hours.
Checking results
View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle. In this example, I only had 5 small tests on 4x GPU's. The large and Xlarge tests write their own files per GPU_x.
Example:
exx@ubuntu:~/Stand_Alone_Validation$ ls clean.x GPU_1.log GPU_3.log lib nohup.out output_files_large run_test.x standalone-test_v3.bin GPU_0.log GPU_2.log input LICENSE output_files README standalone-test.bin standalone-test_v3_p2p.bin exx@ubuntu:~/Stand_Alone_Validation$ cat *.log 0.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430
As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results.
FAQ
Is there anyway of running the test on select cards without going through the trouble of opening the case and yanking out power cables/PCIe cards?
Answer is yes.
Yes you can.
This involves a manual declaration of the env vars, and an adjustment of the script to comment 'CUDA_VISIBLE_DEVICES' out, so this does not over-write the UUID of the GPU of the single GPU card to be tested.
This is an applicable solution for a system admin who is comfortable working in the shell or CLI, and the Exxact GPU server or HPC is in a rack or data-center environment.
Expand the content section below to read more.
Related articles