GPU Validation Test
How to use the GPU validation test. Tested with NVIDIA cards.
Pre-requisites:
- NVIDIA drivers installed (you can check with 'nvidia-smi' command to see if it properly outputs the NVIDIA hardware devices)
Instructions
Download/unpack files into root directoy
wget https://exxact-support.s3.us-west-1.amazonaws.com/Test+Folder/Stand_Alone_Validation_v4.2.1.tar.gz --no-check-certificate tar -xvzf Stand_Alone_Validation_v4.2.1.tar.gz
Change directory to unpacked folder
cd Stand_Alone_Validation
Duration of tests varies depending on GPU's being used. If you are using a smaller GPU specifically for display, you need to remove that GPU and use this system using terminal-view only or SSH to run the test.
Run test in the background by using (run as root)
nohup ./run_test.x &
Monitor GPU temps by opening another terminal and using 'nvidia-smi -l'; once you no longer see the 'standalone-test.bin' process being printed from 'nvidia-smi', you can check the logs to see if your set amount of cycles completed.
exx@ubuntu:~/Stand_Alone_Validation$ nvidia-smi -l Tue Jan 15 17:35:14 2019 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 410.78 Driver Version: 410.78 CUDA Version: 10.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1080 On | 00000000:05:00.0 On | N/A | | 78% 86C P2 149W / 180W | 4767MiB / 8118MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 1080 On | 00000000:06:00.0 Off | N/A | | 77% 86C P2 155W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 1080 On | 00000000:09:00.0 Off | N/A | | 72% 86C P2 124W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 1080 On | 00000000:0A:00.0 Off | N/A | | 59% 83C P2 134W / 180W | 4569MiB / 8119MiB | 100% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1910 G /usr/lib/xorg/Xorg 157MiB | | 0 2889 G compiz 40MiB | | 0 5848 C ../standalone-test.bin 4557MiB | | 1 5849 C ../standalone-test.bin 4557MiB | | 2 5850 C ../standalone-test.bin 4557MiB | | 3 5851 C ../standalone-test.bin 4557MiB | +-----------------------------------------------------------------------------+
As for the time it takes per cycle, I have not yet measured them per small, large, or xlarge cycles. I assume with the 5/5/2 cycles, it will complete in 6-8 hours.
Checking results
View the output logs in the 'Stand_Alone_Validation' directory and make sure the results are matching for each cycle. In this example, I only had 5 small tests on 4x GPU's. The large and Xlarge tests write their own files per GPU_x.
Example:
exx@ubuntu:~/Stand_Alone_Validation$ ./exx-getgpu-validation.sh
The test results will be saved in /tmp/<hostname>_Standard_GPU_validation.txt. View the file and copy the results to the Support Ticket if applicable.
Interpreting Results
GPU 0: NVIDIA GeForce RTX 4090 (UUID: GPU-886818ec-0907-a70e-613c-9a34d1a3398f) Validation Results: ./GPU_0.log : 20, Etot = -58222.0688 EKtot = 14396.2812 EPtot = -72618.3500 -> Passed 1 card(s) valided for Normal Test ./GPU.large_0.log : 10, Etot = -2708653.0371 EKtot = 662946.8750 EPtot = -3371599.9121 -> Passed 1 card(s) valided for Large Test ./GPU.xlarge_0.log : 5, Etot = -8862400.5831 EKtot = 2171066.2500 EPtot = -11033466.8331 -> Passed 1 card(s) valided for XLarge Test Performance Results: Location = . GPU = 0 Normal======= High= 506.41 Low = 502.46 Avg = 503.92 Diff= 3.95 Pts = 0.78 Large======= High= 24.71 Low = 24.62 Avg = 24.67 Diff= 0.09 Pts = 0.36 XLarge======= High= 12.32 Low = 12.31 Avg = 12.31 Diff= 0.01 Pts = 0.08
Validation Results
This section determines if the GPU is calculating results consistently.
The test logs the output the GPU_N.log, GPU.large_N.log, GPU.xlarge_N.log, respectively. Every time the calculation is run the resulting value should be the same. This section of the script confirms the values are the same for the target GPU
Performance Results
This section determines if the GPU is performing consistently.
The test calculates the average ns/day at which the GPU is performing. In the context of high-performance computing and molecular dynamics simulations, ns/day refers to the number of nanoseconds (ns) of simulation time that you can compute in a single day of real-world time. It’s a useful metric for estimating how much simulation progress you can achieve within a given timeframe. The high the better.
- High - the highest metric observed
- Low - the lowest metric observed
- Avg - the average of metrics observed
- Diff - the difference between the High and Low values
- Pts - The percentage of the difference / High values.
- This number should be low and not more than 5%.
- A high value points to an issue.
- Check GPU temp and ensure there is sufficient airflow to the GPU. Turn the fans up to full and retest.
- Swap GPUs and retest to see if the issue follows the GPU or the PCie slot.
Related articles