mprime for CPU/Memory
mprime is optimal to use to quickly isolate bad individual memory DIMM, or trigger a motherboard's memory DIMM slot to report errors.
How we use mprime for CPU/Memory burn-in. Instructions are for Linux. For Windows, use Prime95 and use the same instructions as below.
Download mprime
## Usually wget files into a new directory to keep extracted files in one place, otherwise you will get logs/readme's filling up wherever you're currently in mkdir mprime cd mprime wget http://www.mersenne.org/ftp_root/gimps/p95v303b6.linux64.tar.gz tar -xf p95v303b6.linux64.tar.gz
Do not use for non-server >40 core/thread single-processor systems. mprime/Prime95 is written in Assembly language, which greatly increases the amount of system resources than most programs; running this might be too extreme for your current system, and you may mistaken the results (i.e. abrupt shutdown) as a hardware issue, whereas it just may be a limitation of your system's ability to control system hardware temperatures.
Using mprime
I only use option 16 (Options/Torture Test) since we just need mprime for stress testing. It will max the detected threads across all CPU's, I just hit 'enter' when it asks 'Number of torture test threads to run (x):'.
## Assuming you're not already in the mprime directory you created if you're doing this all in one go from this article cd mprime ./mprime ## If this is your first time extracting/using, it will ask if you want to join GIMPS- I usually hit 'n' since this is for troubleshooting only anyways [root@c103454 mprime]# ./mprime Main Menu 1. Test/Primenet 2. Test/Worker threads 3. Test/Status 4. Test/Continue 5. Test/Exit 6. Advanced/Test 7. Advanced/Time 8. Advanced/P-1 9. Advanced/ECM 10. Advanced/Manual Communication 11. Advanced/Unreserve Exponent 12. Advanced/Quit Gimps 13. Options/CPU 14. Options/Resource Limits 15. Options/Preferences 16. Options/Torture Test 17. Options/Benchmark 18. Help/About 19. Help/About PrimeNet Server Your choice: Number of torture test threads to run (80): Choose a type of torture test to run. 1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress). 2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress). 3 = Large FFTs (stresses memory controller and RAM). 4 = Blend (tests all of the above). Blend is the default. NOTE: if you fail the blend test but pass the smaller FFT tests then your problem is likely bad memory or bad memory controller. Type of torture test to run (4):
Usually run test (2) for CPU only, (3) for Memory only, and Blend (4) to perform both.
Your choice: 16 Number of torture test threads to run (80): Choose a type of torture test to run. 1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress). 2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress). 3 = Large FFTs (stresses memory controller and RAM). 4 = Blend (tests all of the above). Blend is the default. NOTE: if you fail the blend test but pass the smaller FFT tests then your problem is likely bad memory or bad memory controller. Type of torture test to run (4): Customize settings (N): y Min FFT size (in K) (4): Max FFT size (in K) (8192): Memory to use (in MB, 0 = in-place FFTs) (382415): 360000 Time to run each FFT size (in minutes) (6): Run a weaker torture test (not recommended) (N): Accept the answers above? (Y):
mprime default settings will want to use ALL resources on small FFT's which will cause it to max TDP sharply and slow the system heavily during the beginning/smaller tests. It may become near-unusable if you leave everything on default. To remedy this, I usually set Min FFT size to 128/256, and leave 1-2 GB of memory free for system processes. Below is an example of the 'Memory to use' setting I set to ensure mprime does not freeze up the system when starting the test.
Customize settings (N): y Min FFT size (in K) (4): 128 Max FFT size (in K) (8192): Memory to use (in MB, 0 = in-place FFTs) (382415): 360000 Time to run each FFT size (in minutes) (6): Run a weaker torture test (not recommended) (N): Accept the answers above? (Y):
From here, it is nicer to use whatever GUI/display or tools to monitor resources/temperatures.
Expected outcomes (Pass/Fail)
Fail
If system crashes/reboots on memory test (3) or blend (4), it could mean a couple of things (or more, as we continue using this tool to troubleshoot systems of varying hardware). If you are not sure, usually keeping track of the system uptime or IPMI event logs will help show the system's history.
- Memory is bad
- You would need to check IPMI/BMC logs to see if the system pinged an error from one of the DIMM slots
- 'ipmitool sel list' is another output you can try to view. This requires 'ipmitool' to be installed. I personally view the web GUI as it may contain additional information such as temp sensors/flags since BMC event logs may only report critical errors
- You can also check 'edac-util -v' to report any errors from system edac files... It can point out which memory DIMM slot is reporting correctable/uncorrectable errors.
- Ideally you would want 0 across all DIMM's, but a few <100 over a week is somewhat negligible. If it is reporting THOUSANDS upon a few hours (and it may even overflow the edac folders), then it is definitely sign of a bad memory DIMM that needs to be replaced)
- You would want to first re-seat the memory, improperly seated memory caused by any factor could cause correctable/negligible errors
- If it is uncorrected errors, it most likely would have triggered a reboot/shutdown since those are very critical memory errors that affect the whole system
- You would need to check IPMI/BMC logs to see if the system pinged an error from one of the DIMM slots
- CPU or Motherboad is bad
- Both highly unlikely since scalable processors released. Damage to contacts/capacitors on CPU's may cause misreads on memory or any installed hardware, or even bent pins on the MB CPU socket.
- The only example I have seen previously is memory slots being reported on NUMEROUS DIFFERENT slots instead of just one- it may report a whole row is bad (i.e. DIMM slots DIMM_D1, DIMM_E2, DIMM_F1 are all reporting bad, and swapping DIMM's with that row continues repeating that event log)
- Both highly unlikely since scalable processors released. Damage to contacts/capacitors on CPU's may cause misreads on memory or any installed hardware, or even bent pins on the MB CPU socket.
Pass
System will have mprime running after 24 hours. I do not recall if the tests complete after 24-hours, but I usually leave the test running over weekends and the tests is still going/reporting when I am back the Monday after. You can determine your own duration depending on your site's expectations. I feel comfortable if a system can run a blend test for 2-3 days straight.
Ongoing discovery
The procedures to run the mprime test above will not change much, but the outcomes will. For example, I try not to have any other processes running when using mprime. It was fine before scalable processes where CPU cache sizes were significantly smaller, but any other memory-related processes may kill mprime without any specific reason in the 'results.txt'. An example is trying to run the GPU standalone test in this article GPU Validation Test while trying to run mprime. To test high wattage/TDP, I will run mprime test 2 to max out CPU, and then GPU Validation Test to (almost) max TDP across ALL installed GPU's that are CUDA-capable (i.e. mprime will run two CPU's to 120-140w each, and GPU Validation will run ~250-300W on each GPU). Memory/blend test not ran with GPU Validation since it will cause stability issues with memory.
Other Linux CPU testing options:
https://linuxhint.com/useful_linux_stress_test_benchmark_cpu_perf/