Hands-on: Hardware Performance Counters
Task: Explore the behavior of the MFPCG benchmark using likwid-perfctr
In this exercise you will analyze and predict the data access pattern
of MFPCG and validate your prediction with
likwid-perfctr
measurements.
Preparation
You can find the benchmark code in the MFCG folder of the teacher account. Copy it again since there might have been updates.
$ cp -r ~x19a0001/MFCG .
Investigate the benchmark code
The MFPCG benchmark implements a matrix-free Conjugate Gradient (CG) solver with a Gauss-Seidel preconditioner. The main solver loop is in Solver.cpp
from line 30. Four functions are called in this loop, two of which are in Grid.cpp
and the other two are in PDE.cpp
.
Run benchmark
Whole application measurements
Run on a single core and on a full socket with the MEM_DP group:
$ module load likwid intel $ make $ likwid-perfctr -g MEM_DP -C S0:0 ./perf 2500 40000 $ likwid-perfctr -g MEM_DP -C S0:0-15 ./perf 2500 40000
The benchmark itself prints a performance number in MLUP/s (million lattice site updates per second). Questions:
- What is the computational intensity of the code in flops/byte?
- Does the performance scale from 1 core to the full socket?
- Is the clock speed constant and independent of the number of cores used?
Analyze binary for the most time-consuming functions
The binary perf-serial
is built along with the parallel binary but with options -qopenmp-stubs
and -pg
.
Run the binary and get the runtime profile:
$ ./perf-serial 2500 40000 $ gprof --flat-profile perf-serial gmon.out
What is the hot spot of the program, i.e., where is most of the runtime spent?
In-depth analysis
Instrument the binary yourself using the LIKWID Marker API:
- one region around the whole PCG loop (
src/Solver.cpp
) - one around the each of the two loops in the preconditioner in
PDE::GSPreCon
(src/PDE.cpp
) - one around the loop in
Grid::axpby()
(src/Grid.cpp
)
Add $LIKWID_INC
, $LIKWID_LIB
, -llikwid
and -DLIKWID_PERFMON
to the Makefile. Note: These shell variables are set by our likwid module, so they are not portable to other systems.
Compile the code with:
$ make
First run on a single core with the MEM_DP
and the FLOPS_DP
groups:
$ likwid-perfctr -g <group> -C S0:0 -m ./perf 2500 40000
Look at the following derived metrics in the different regions:
- overall instructions executed
- floating-point performance
- Vectorization ratio
- Memory read/write data volume ratio
- memory bandwidth
How do the two parts of the preconditioner compare with each other and with axpby
? Is the code vectorized? What is the SIMD width? What fraction of the arithmetic instructions are vector instructions? What fraction of the overall arithmetic work is done with vectorized instructions?
Now execute the benchmark using all cores on one socket:
$ likwid-perfctr -g MEM_DP -C S0:0-15 -m ./perf 2500 40000Questions:
- Do the two parts of the preconditioner scale across cores?
- Does the performance of the
axpby
loop scale? - Any theories?
Optional task if you finish early: Investigate changes with different grid size: Run the MFPCG with the grid size 10000x10000 and compare the results with the results of the 2500x40000 runs:
$ likwid-perfctr -g MEM_DP -C S0:0-15 -m ./perf 10000 10000
- How does the bandwidth change?
- Any other metric that changes significantly?
- What could be the reason in case of changes?