LIKWID-tut: Hands-on: Hardware Performance Counters

Task: Explore the behavior of the MFPCG benchmark using likwid-perfctr

In this exercise you will analyze and predict the data access pattern of MFPCG and validate your prediction with likwid-perfctr measurements.

Preparation

You can find the benchmark code in the MFCG folder in your home directory.

Investigate the benchmark code

The MFPCG benchmark implements a matrix-free Conjugate Gradient (CG) solver with a Gauss-Seidel preconditioner. The main solver loop is in Solver.cpp from line 30. Four functions are called in this loop, two of which are in Grid.cpp and the other two are in PDE.cpp.

Run benchmark

Whole application measurements

Run on a single core and on a full socket with the MEM_DP group:

$ module load likwid intel
$ make
$ srun --cpu-freq=2900000-2900000:performance likwid-perfctr -g MEM_DP -C S0:0 ./perf 2500 40000
$ srun --cpu-freq=2900000-2900000:performance likwid-perfctr -g MEM_DP -C S0:0-15 ./perf 2500 40000

The benchmark itself prints a performance number in MLUP/s (million lattice site updates per second). Questions:

What is the computational intensity of the code in flops/byte?
Does the performance scale from 1 core to the full socket?
Is the clock speed constant and independent of the number of cores used?

Analyze binary for the most time-consuming functions

The binary perf-serial is built along with the parallel binary but with options -qopenmp-stubs and -pg.

Run the binary and get the runtime profile:

$ ./perf-serial 2500 40000
$ gprof --flat-profile perf-serial gmon.out

What is the hot spot of the program, i.e., where is most of the runtime spent?

In-depth analysis

Instrument the binary yourself using the LIKWID Marker API:

one region around the whole PCG loop (src/Solver.cpp)
one around the each of the two loops in the preconditioner in PDE::GSPreCon (src/PDE.cpp)
one around the loop in Grid::axpby() (src/Grid.cpp)

Add $LIKWID_INC, $LIKWID_LIB, -llikwid and -DLIKWID_PERFMON to the Makefile. Note: These shell variables are set by our likwid module, so they are not portable to other systems.

Compile the code with:

$ make

First run on a single core with the MEM_DP and the FLOPS_DP groups:

$ srun --cpu-freq=2900000-2900000:performance likwid-perfctr -g <group> -C S0:0 -m ./perf 2500 40000

Look at the following derived metrics in the different regions:

overall instructions executed
floating-point performance
Vectorization ratio
Memory read/write data volume ratio
memory bandwidth

How do the two parts of the preconditioner compare with each other and with axpby? Is the code vectorized? What is the SIMD width? What fraction of the arithmetic instructions are vector instructions? What fraction of the overall arithmetic work is done with vectorized instructions?

Now execute the benchmark using all cores on one socket:

$ srun --cpu-freq=2900000-2900000:performance likwid-perfctr -g MEM_DP -C S0:0-15 -m ./perf 2500 40000

Questions:

Do the two parts of the preconditioner scale across cores?
Does the performance of the axpby loop scale?
Any theories?

Optional task if you finish early: Investigate changes with different grid size: Run the MFPCG with the grid size 10000x10000 and compare the results with the results of the 2500x40000 runs:

$ srun --cpu-freq=2900000-2900000:performance likwid-perfctr -g MEM_DP -C S0:0-15 -m ./perf 10000 10000

How does the bandwidth change?
Any other metric that changes significantly?
What could be the reason in case of changes?

Last modified: Thursday, 31 July 2025, 11:51 AM