Hands-on: Matrix-free CG solver
In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme.
Note: Remember that you have to select the "SPR" node in the Jupyterhub interface today!
Preparation
Copy the source files to your home directory via
$ cp -a ~q26z0001/MFCG ~
Build and run
There are C and F90 source codes available in the C/ and F90/ folders. Build the executable from C with:
$ icx -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c
In case of Fortran you need (we are using the old intel Fortran compiler because the new one [ifx] does not support gprof profiling):
$ icx -c timingC.c
$ ifort -Ofast -xHost -o ./mfcg timingC.o ./mfcg.f90
Test if it is working:
$ likwid-pin -c S0:2 ./mfcg 8000 20000
The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. Performance-wise, this is important only for the stencil update part of the algorithm. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second. This number pertains to the whole algorithm, not just the stencil update.
Performance Engineering
Time profile
Compile the code with the -pg switch to instrument with gprof (works fine with ifort, too):
$ icx -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.c
After running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool
$ gprof ./mfcg
The result is a "flat profile" and a "butterfly graph" of the application. look at the flat profile and think about what went wrong here. Fix it and do the profile again.
What is the "hot spot" of the program?
-- END of part 1 ----------------------------------------------------------------------------------------------------------------
OpenMP parallelization
The code already has OpenMP directives. Add the "-qopenmp" option to the compiler command line to activate OpenMP. Run the code on the physical cores of one ccNUMA domain. What is the performance on the full domain as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?
Do and end-to-end performance counter measurement of the memory bandwidth of the program on the first 13 cores of one SPR node (one ccNUMA domain):
$ likwid-perfctr -C S0:0-12 -g MEM ./mfcg 8000 20000
The memory bandwidth of a ccNUMA domain on this node is roughly 60-65 Gbyte/s (you can also measure it using BWBENCH). Is your hypothesis confirmed?
Make a Roofline model of the hot spot loop you have identified above and of the loop nest in the applyStencil() function!
Performance profiling
Instrument the hot spot loop and the applyStencil() loop in the source code with the LIKWID marker API (there are some source files there to help you). For threaded code, LIKWID_MARKER_START() must be called from every executing thread!
(Fortran programmers please look at https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerF90).
Build the new version with:
$ icx -qopenmp -Ofast -xhost -std=c99 -DLIKWID_PERFMON -o ./mfcg $LIKWID_INC ./mfcg-marker.c $LIKWID_LIB -llikwid
or:
$ ifort -qopenmp -Ofast -xhost -o ./mfcg $LIKWID_INC ./mfcg-marker.f90 $LIKWID_LIB -llikwid timingC.o
Test your new version using:
$ likwid-perfctr -C S0:0-12 -g MEM_DP -m ./mfcg 8000 20000
Does the hot spot loop and the applyStencil loop run according to your Roofline prediction? Validate your model by checking your data traffic volume prediction. Hint: The size of one vector in the code is the product of the two command line arguments; the number of times your hotspot loop is executed is given in the likwid-perfctr output. You can take it from there )
If you have the time you can continue the analysis with the other important loops in the code. Does your code's performance scale from one ccNUMA domain to two, four, and eight (a full node)?
Analysis, Optimization, and Validation
Can you think of an optimization that would improve the performance of the entire algorithm? Think about what the overall bottleneck is and how you can reduce the "pressure" on this bottleneck.
You don't have to go all the way with this. Implement something of which you know it will improve performance. Try to validate the effect with measurements.