NLPE_Durham: Hands-on: Matrix-free CG solver (job script)

In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme.

Preparation

Copy the source files to your home directory (if no have not done so) via

$ cp -a ~dc-grub1/NLPE-Durham $HOME
$ cd NLPE-Durham/MFCG

Read and edit the job script

The job script lets you decide whether you want to use C or Fortran and whether GCC or the Intel ICX compilers should be used.

$ vi compiler-and-language-selection.conf

Submit the job to get initial output for a single-core run

$ sbatch job-dine2-part1.sh

The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. Performance-wise, this is important only for the stencil update part of the algorithm. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second. This number pertains to the whole algorithm, not just the stencil update.

-- END of part 1 ----------------------------------------------------------------------------------------------------------------

Performance Engineering

Time profile

Take a look at job-dine2-part2.sh. It compiles the code with -pg to enable runtime profiling because we don't know yet which functions are the time-consuming ones.

$ sbatch job-dine2-part2.sh

When you kept reading the job script, you might realize that the application is compiled & ran twice. The reason is that in the first run, all functions are inlined due to -Ofast and missing -fno-inline and we cannot see which function took most of the time.

What is the "hot spot" of the program?

-- END of part 2 ----------------------------------------------------------------------------------------------------------------

OpenMP parallelization

The code already has OpenMP directives.You can submit job-dine2-part3.sh

$ sbatch job-dine2-part3.sh

It adds the OpenMP option to the compiler command line to activate OpenMP. Run the code on the physical cores of one ccNUMA domain. What is the performance on the full domain as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?

The script does an end-to-end performance counter measurement of the memory bandwidth of the program. Remembering what you learned about the memory bandwidth of a socket, is your hypothesis confirmed?

Make a Roofline model of the hot spot loop you have identified above and of the loop nest in the applyStencil() function!

-- END of part 3 ----------------------------------------------------------------------------------------------------------------

Performance profiling

Instrument the hot spot loop and the applyStencil() loop in the source code with the LIKWID marker API. For threaded code, LIKWID_MARKER_START() must be called from every executing thread!

Copy the original code for instrumentation:

$ cp C/mfcg.c C/mfcg-marker.c
# $ cp F90/mfcg.f90 F90/mfcg-marker.c

Submit job-dine2-part4.sh

$ sbatch job-dine2-part4.sh

Does the hot spot loop and the applyStencil loop run according to your Roofline prediction? Validate your model by checking your data traffic volume prediction. Hint: The size of one vector in the code is the product of the two command line arguments; the number of times your hotspot loop is executed is given in the likwid-perfctr output. You can take it from there )

If you have the time you can continue the analysis with the other important loops in the code. Does your code's performance scale from one socket to two?

Analysis, Optimization, and Validation

Can you think of an optimization that would improve the performance of the entire algorithm? Think about what the overall bottleneck is and how you can reduce the "pressure" on this bottleneck.

You don't have to go all the way with this. Implement something of which you know it will improve performance. Try to validate the effect with measurements.

Last modified: Friday, 12 June 2026, 6:40 PM