Task: Explore the behavior of the MFPCG benchmark using likwid-perfctr

In this exercise you will analyze and predict the data access pattern of MFPCG and validate your prediction with  likwid-perfctr measurements.

Preparation

You can find the benchmark code in the MFCG folder of the teacher account. Copy it again since there might have been updates.

$ cp -r ~f51h0001/MFCG .

Investigate the benchmark code

The MFPCG benchmark implements a matrix-free Conjugate Gradient (CG) solver with a Gauss-Seidel preconditioner. The main solver loop is in Solver.cpp from line 76. Four functions are called in this loop, two of which are in Grid.cpp and the other two are in PDE.cpp.

Run benchmark

Whole application measurements

Run on a single core and on a full socket with the MEM_DP group:

$ module load likwid intel
$ make
$ likwid-perfctr -g MEM_DP -C S0:0 ./perf 2500 40000
$ likwid-perfctr -g MEM_DP -C S0:0-15 ./perf 2500 40000

The benchmark itself prints a performance number in MLUP/s (million lattice site updates per second). Questions:

  1. What is the computational intensity of the code in flops/byte?
  2. Does the performance scale from 1 core to the full socket?  
  3. Is the clock speed constant and independent of the number of cores used?


Analyze binary for the most time-consuming functions

The binary perf-serial is built along with the parallel binary but with options -qopenmp-stubs and -pg.

Run the binary and get the runtime profile:

$ ./perf-serial 2500 40000
$ gprof --flat-profile perf-serial gmon.out

What is the hot spot of the program, i.e., where is most of the runtime spent?

In-depth analysis

Instrument the binary yourself using the LIKWID Marker API:

  • one region around the whole PCG loop (src/Solver.cpp)
  • one around the each of the two loops in the preconditioner in PDE::GSPreCon (src/PDE.cpp)
  • one around the loop in Grid::axpby() (src/Grid.cpp)

Add $LIKWID_INC, $LIKWID_LIB, -llikwid and -DLIKWID_PERFMON to the Makefile. Note: These shell variables are set by our likwid module, so they are not portable to other systems.

Compile the code with:

$ make

First run on a single core with the MEM_DP and the FLOPS_DP groups:

$ likwid-perfctr -g <group> -C S0:0 -m ./perf 2500 40000

Look at the following derived metrics in the different regions:

  • overall instructions executed
  • floating-point performance
  • Vectorization ratio
  • Memory read/write data volume ratio
  • memory bandwidth

How do the two parts of the preconditioner compare with each other and with axpbyIs the code vectorized? What is the SIMD width? What fraction of the arithmetic instructions are vector instructions? What fraction of the overall arithmetic work is done with vectorized instructions?

Now execute the benchmark using all cores on one socket:

$ likwid-perfctr -g MEM_DP -C S0:0-15 -m ./perf 2500 40000
Questions:

  1. Do the two parts of the preconditioner scale across cores? 
  2. Does the performance of the axpby loop scale?
  3. Any theories?

Optional task if you finish early: Investigate changes with different grid size: Run the MFPCG with the grid size 10000x10000 and compare the results with the results of the 2500x40000 runs:

$ likwid-perfctr -g MEM_DP -C S0:0-15 -m ./perf 10000 10000

  • How does the bandwidth change?
  • Any other metric that changes significantly?
  • What could be the reason in case of changes?

Last modified: Sunday, 23 July 2023, 7:02 PM