How to use the GPU validation test. Tested with NVIDIA cards.
Pre-requisites:
- NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)
Instructions
Download/unpack files into root directoy
Use these commands id you have a 20XX series GPU:Code Block language java theme Emacs wget https://s3-us-west-1.amazonaws.com/exxact-support/Test+Folder/Stand_Alone_Validation_v3.1.tar.gz --no-check-certificate tar -xvzf Stand_Alone_Validation_v3.1.tar.gz
Use these commands id you have a 30XX series GPU:Code Block language java theme Emacs wget https://exxact-disk-images.s3-us-west-1.amazonaws.com/AMBER+Stand+Alone+Test/Stand_Alone_Validation_v4.0.tar.gz --no-check-certificate tar -xvzf Stand_Alone_Validation_v4.0.tar.gz
Change directory to unpacked folder
Code Block language java theme Emacs cd Stand_Alone_Validation
Set amount of GPU's/test cycles desired by editing 'run_test.x' file
Code Block language java theme Emacs nano run_test.x #How many GPUs in node gpu_count=4 #How many tests to run of each type #Large test requires 5GB memory #Xlarge test requires 11GB memory small_test_count=20 large_test_count=10 xlarge_test_count=5
Note: Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.
Save changes using 'ctrl+x' and answering 'y' to the prompt; I typically like to set 5/5/2 tests. The default amount of cycles are typically meant for overnight/long duration testing
Run test in the background by using (run as root)
Code Block language java theme Emacs nohup ./run_test.x &
Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your set amount of cycles completed.
Code Block language java theme Emacs exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l Tue Jan 15 17:35:14 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 On | 00000000:05:00.0 On | N/A | | 78% 86C P2 149W / 180W | 4767MiB / 8118MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 On | 00000000:06:00.0 Off | N/A | | 77% 86C P2 155W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 On | 00000000:09:00.0 Off | N/A | | 72% 86C P2 124W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 On | 00000000:0A:00.0 Off | N/A | | 59% 83C P2 134W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1910 G /usr/lib/xorg/Xorg 157MiB | | 0 2889 G compiz 40MiB | | 0 5848 C ../standalone-test.bin 4557MiB | | 1 5849 C ../standalone-test.bin 4557MiB | | 2 5850 C ../standalone-test.bin 4557MiB | | 3 5851 C ../standalone-test.bin 4557MiB | +-----------------------------------------------------------------------------+
As for the time it takes per cycle, I have not yet measured them per small, large, or xlarge cycles. I assume with the 5/5/2 cycles, it will complete in 6-8 hours.
Checking results
View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle. In this example, I only had 5 small tests on 4x GPU's. The large and Xlarge tests write their own files per GPU_x.
Example:
Code Block | ||||
---|---|---|---|---|
| ||||
exx@ubuntu:~/Stand_Alone_Validation$ ls clean.x GPU_1.log GPU_3.log lib nohup.out output_files_large run_test.x standalone-test_v3.bin GPU_0.log GPU_2.log input LICENSE output_files README standalone-test.bin standalone-test_v3_p2p.bin exx@ubuntu:~/Stand_Alone_Validation$ cat *.log 0.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 0.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 1.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 2.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.0: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.1: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.2: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.3: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 3.4: Etot = -58216.8663 EKtot = 14421.1768 EPtot = -72638.0430 |
Info |
---|
As you can see above, 0.0 = GPU, cycle = Etot = EKtot = EPtot. I have 4 GPU's that has passed 5 cycles of the small test with matching results. |
FAQ
Note |
---|
|
Tip | ||
---|---|---|
Answer is yes. This involves a manual declaration of the env vars, and an adjustment of the script to comment 'CUDA_VISIBLE_DEVICES' out, so this does not over-write the UUID of the GPU of the single GPU card to be tested. Expand the content section below to read more.
|
Related articles
Filter by label (Content by label) | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Page Properties | ||
---|---|---|
| ||
|