Analyzing a preconditioned CG solver

Preparation

$ module load likwid intel
$ make

Run benchmark

Whole application measurements

Running on a single core and on a full socket with the MEM_DP group:

$ likwid-perfctr -g MEM_DP -C S0:0 ./perf 2500 40000
$ likwid-perfctr -g MEM_DP -C S0:0-17 ./perf 2500 40000

The group or eventset for the measurement is defined with -g. Where to run is -c (just measure these cores but don't pin the application to hardware threads) or -C (measure & pin, recommended).

The S0 is the so-called thread domain for socket 0, so S1 would be socket 1. The notion S0:0 tells LIKWID to run on socket 0 on the first (index 0) hardware thread. S0:0-17 also tells LIKWID to use socket 0 with the hardware threads indexed 0 to 17. There are other thread domain, check

likwid-pin -p
to see all thread domains with their list of hardware threads. For the <thread domain>:<range> syntax, the list is sorted before selection to have the physical hardware threads first in the list. Other thread domains start with different characters: N for node (without a number), M for NUMA domain, D for CPU die and C for L3 cache.

You can also use a comma-seperated list of hardware thread IDs to specify where to run like 0,4,12 or 2-5,8,11-12. For more information about the supported hardware thread selection syntaxes can be found [here].

The benchmark itself prints a performance number in MLUP/s (million lattice site updates per second).

The computational intensity of the code in flops/byte shows how many floating point operations can be performed for the bytes loaded from main memory. Independent of the number of threads used, it is around 0.25. This this is a whole application measurement, it covers the whole runtime and is therefore an average over all phases (allocation, initialization, computation, evaluation and cleanup).

The performance scale from 1 core to the full socket not linearly. The single core run achieves 78 MLUP/s while 420 MLUP/s with 18 hardware threads. This is only a speedup of 5.4, so there is some performance limiting factor in the code.

In-depth analysis

In the demo, we instrument the binary using the LIKWID Marker API in three files based on a runtime profile with the tool gprof which revealed the most time consuming functions:

  1. src/Solver.cpp (whole PCG loop)
  2. src/PDE.cpp (two loops in the preconditioner in PDE:GSPreCon())
  3. src/Grid.cpp (loop in Grid:axpby())

You can find the changes in the list of patch files (apply with patch -p1 < /path/to/patchfile). We add $LIKWID_INC, $LIKWID_LIB, -llikwid and -DLIKWID_PERFMON to the Makefile. Note: These shell variables are set by our likwid module, so they are not portable to other systems.

We recompile the code with:

$ make clean
$ make

First we run on a single core with the MEM_DP and the FLOPS_DP groups:

$ likwid-perfctr -g <group> -C S0:0 -m ./perf 2500 40000

When looking at the derived metrics in the different regions, we look for:

  • Overall instructions executed
  • Floating-point performance
  • Vectorization ratio
  • Memory read/write data volume ratio
  • Memory bandwidth

How do the two parts of the preconditioner compare with each other and with axpbyaxpby has a higher DP FP rate and all FP operations are vectorized operations. But it does not use the vectorization with the widest SIMD registers (512B for AVX512). The two parts of the preconditioner are not vectorized at all, thus use only scalar FP instructions. While they both run for the same time, the forward substition executes more instructions. The fraction of arithmetic instructions compared to all instructions show that a high percentage of the instructions in the preconditioner are floating-point instructions. For axpby it is less since one 256B vector instruction processes 4 double element.

Now execute the benchmark using all cores on one socket:

$ likwid-perfctr -g MEM_DP -C S0:0-17 -m ./perf 2500 40000

Note: The maximal memory bandwidth of a single NUMA domain of a Fritz node is around 72-74 GByte/s for read and write operations.

The two parts of the preconditioner do not scale with the number of hardware threads but are not limited by the memory bandwidth because there is still some headroom (forward 66 GByte/s, backward 55 GByte/s). When looking at the DP FP rates of the individual cores or the min/max/avg values in the statistics table, no load imbalance visible for both parts.

The performance of the axpby loop does not scale with the number of hardware threads, but it achieves a memory bandwidth of 72 GByte/s, which is close the peak bandwidth for a NUMA domain (for read and write operations like a memory copy).

The lower performance of the preconditioner is caused by the OpenMP barriers inside the outer loop for the require synchonization among the threads. The synchonization is required due to algorithm used. The only option is to find a different algorithm for the pre-conditioner which does not have this performance limitation.

Last modified: Thursday, 14 September 2023, 6:22 PM