mprime for CPU/Memory

mprime is optimal to use to quickly isolate bad individual memory DIMM, or trigger a motherboard's memory DIMM slot to report errors.

How we use mprime for CPU/Memory burn-in. Instructions are for Linux. For Windows, use Prime95 and use the same instructions as below.

Download mprime

## Usually wget files into a new directory to keep extracted files in one place, otherwise you will get logs/readme's filling up wherever you're currently in

mkdir mprime
cd mprime
wget http://www.mersenne.org/ftp_root/gimps/p95v303b6.linux64.tar.gz
tar -xf p95v303b6.linux64.tar.gz

Do not use for non-server >40 core/thread single-processor systems. mprime/Prime95 is written in Assembly language, which greatly increases the amount of system resources than most programs; running this might be too extreme for your current system, and you may mistaken the results (i.e. abrupt shutdown) as a hardware issue, whereas it just may be a limitation of your system's ability to control system hardware temperatures.

Using mprime

I only use option 16 (Options/Torture Test) since we just need mprime for stress testing. It will max the detected threads across all CPU's, I just hit 'enter' when it asks 'Number of torture test threads to run (x):'.

## Assuming you're not already in the mprime directory you created if you're doing this all in one go from this article

cd mprime
./mprime

## If this is your first time extracting/using, it will ask if you want to join GIMPS- I usually hit 'n' since this is for troubleshooting only anyways


[root@c103454 mprime]# ./mprime
             Main Menu

         1.  Test/Primenet
         2.  Test/Worker threads
         3.  Test/Status
         4.  Test/Continue
         5.  Test/Exit
         6.  Advanced/Test
         7.  Advanced/Time
         8.  Advanced/P-1
         9.  Advanced/ECM
        10.  Advanced/Manual Communication
        11.  Advanced/Unreserve Exponent
        12.  Advanced/Quit Gimps
        13.  Options/CPU
        14.  Options/Resource Limits
        15.  Options/Preferences
        16.  Options/Torture Test
        17.  Options/Benchmark
        18.  Help/About
        19.  Help/About PrimeNet Server
Your choice:


Number of torture test threads to run (80):
Choose a type of torture test to run.
  1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
  2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
  3 = Large FFTs (stresses memory controller and RAM).
  4 = Blend (tests all of the above).
Blend is the default.  NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.
Type of torture test to run (4):

Usually run test (2) for CPU only, (3) for Memory only, and Blend (4) to perform both.

Your choice: 16

Number of torture test threads to run (80):
Choose a type of torture test to run.
  1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
  2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
  3 = Large FFTs (stresses memory controller and RAM).
  4 = Blend (tests all of the above).
Blend is the default.  NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.
Type of torture test to run (4):

Customize settings (N): y
Min FFT size (in K) (4):
Max FFT size (in K) (8192):
Memory to use (in MB, 0 = in-place FFTs) (382415): 360000
Time to run each FFT size (in minutes) (6):
Run a weaker torture test (not recommended) (N):

Accept the answers above? (Y):

mprime default settings will want to use ALL resources on small FFT's which will cause it to max TDP sharply and slow the system heavily during the beginning/smaller tests. It may become near-unusable if you leave everything on default. To remedy this, I usually set Min FFT size to 128/256, and leave 1-2 GB of memory free for system processes. Below is an example of the 'Memory to use' setting I set to ensure mprime does not freeze up the system when starting the test. 

Customize settings (N): y
Min FFT size (in K) (4): 128
Max FFT size (in K) (8192):
Memory to use (in MB, 0 = in-place FFTs) (382415): 360000
Time to run each FFT size (in minutes) (6):
Run a weaker torture test (not recommended) (N):

Accept the answers above? (Y):

From here, it is nicer to use whatever GUI/display or tools to monitor resources/temperatures.

Expected outcomes (Pass/Fail)

Fail

If system crashes/reboots on memory test (3) or blend (4), it could mean a couple of things (or more, as we continue using this tool to troubleshoot systems of varying hardware). If you are not sure, usually keeping track of the system uptime or IPMI event logs will help show the system's history.

  1. Memory is bad
    • You would need to check IPMI/BMC logs to see if the system pinged an error from one of the DIMM slots
      • 'ipmitool sel list' is another output you can try to view. This requires 'ipmitool' to be installed. I personally view the web GUI as it may contain additional information such as temp sensors/flags since BMC event logs may only report critical errors
    • You can also check 'edac-util -v' to report any errors from system edac files... It can point out which memory DIMM slot is reporting correctable/uncorrectable errors.
      • Ideally you would want 0 across all DIMM's, but a few <100 over a week is somewhat negligible. If it is reporting THOUSANDS upon a few hours (and it may even overflow the edac folders), then it is definitely sign of a bad memory DIMM that needs to be replaced)
      • You would want to first re-seat the memory, improperly seated memory caused by any factor could cause correctable/negligible errors
      • If it is uncorrected errors, it most likely would have triggered a reboot/shutdown since those are very critical memory errors that affect the whole system
      •  edac-util example
        [root@c103454 mprime]# edac-util -v
        mc0: 0 Uncorrected Errors with no DIMM info
        mc0: 0 Corrected Errors with no DIMM info
        mc0: csrow0: 0 Uncorrected Errors
        mc0: csrow0: CPU_SrcID#0_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
        mc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
        mc0: csrow0: CPU_SrcID#0_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
        mc1: 0 Uncorrected Errors with no DIMM info
        mc1: 0 Corrected Errors with no DIMM info
        mc1: csrow0: 0 Uncorrected Errors
        mc1: csrow0: CPU_SrcID#0_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
        mc1: csrow0: CPU_SrcID#0_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
        mc1: csrow0: CPU_SrcID#0_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
        mc2: 0 Uncorrected Errors with no DIMM info
        mc2: 0 Corrected Errors with no DIMM info
        mc2: csrow0: 0 Uncorrected Errors
        mc2: csrow0: CPU_SrcID#1_MC#0_Chan#0_DIMM#0: 0 Corrected Errors
        mc2: csrow0: CPU_SrcID#1_MC#0_Chan#1_DIMM#0: 0 Corrected Errors
        mc2: csrow0: CPU_SrcID#1_MC#0_Chan#2_DIMM#0: 0 Corrected Errors
        mc3: 0 Uncorrected Errors with no DIMM info
        mc3: 0 Corrected Errors with no DIMM info
        mc3: csrow0: 0 Uncorrected Errors
        mc3: csrow0: CPU_SrcID#1_MC#1_Chan#0_DIMM#0: 0 Corrected Errors
        mc3: csrow0: CPU_SrcID#1_MC#1_Chan#1_DIMM#0: 0 Corrected Errors
        mc3: csrow0: CPU_SrcID#1_MC#1_Chan#2_DIMM#0: 0 Corrected Errors
  2. CPU or Motherboad is bad
    • Both highly unlikely since scalable processors released. Damage to contacts/capacitors on CPU's may cause misreads on memory or any installed hardware, or even bent pins on the MB CPU socket.
      • The only example I have seen previously is memory slots being reported on NUMEROUS DIFFERENT slots instead of just one- it may report a whole row is bad (i.e. DIMM slots DIMM_D1, DIMM_E2, DIMM_F1 are all reporting bad, and swapping DIMM's with that row continues repeating that event log)

Pass

System will have mprime running after 24 hours. I do not recall if the tests complete after 24-hours, but I usually leave the test running over weekends and the tests is still going/reporting when I am back the Monday after. You can determine your own duration depending on your site's expectations. I feel comfortable if a system can run a blend test for 2-3 days straight.

Ongoing discovery

The procedures to run the mprime test above will not change much, but the outcomes will. For example, I try not to have any other processes running when using mprime. It was fine before scalable processes where CPU cache sizes were significantly smaller, but any other memory-related processes may kill mprime without any specific reason in the 'results.txt'. An example is trying to run the GPU standalone test in this article GPU Validation Test while trying to run mprime. To test high wattage/TDP, I will run mprime test 2 to max out CPU, and then GPU Validation Test to (almost) max TDP across ALL installed GPU's that are CUDA-capable (i.e. mprime will run two CPU's to 120-140w each, and GPU Validation will run ~250-300W on each GPU). Memory/blend test not ran with GPU Validation since it will cause stability issues with memory.

Other Linux CPU testing options:

https://linuxhint.com/useful_linux_stress_test_benchmark_cpu_perf/