Hands-on: Matrix-free CG solver
In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme.
Preparation
Copy the source files to your home directory via
$ cp -a ~g64g0000/MFCG ~
Build and run
There are C and F90 source codes available in the C/ and F90/ folders. Build the executable from C with:
$ icx -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c
In case of Fortran you need (we are using the old intel Fortran compiler because the new one [ifx] does not support gprof profiling):
$ icx -c timingC.c
$ ifort -Ofast -xHost -o ./mfcg timingC.o ./mfcg.f90
Test if it is working:
$ srun -u --cpu-freq=2400000-2400000:performance likwid-pin -c S0:2 ./mfcg 8000 20000
(Note: The "-u" switch is to disable SLURM output buffering so that you can see each step of the algorithm right away). The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. Performance-wise, this is important only for the stencil update part of the algorithm. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second. This number pertains to the whole algorithm, not just the stencil update.
Performance Engineering
Time profile
Compile the code with the -pg switch to instrument with gprof (works fine with ifort, too):
$ icx -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.c
After running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool
$ gprof ./mfcg
The result is a "flat profile" and a "butterfly graph" of the application. look at the flat profile and think about what went wrong here. Fix it and do the profile again.
What is the "hot spot" of the program?
-- END of part 1 ----------------------------------------------------------------------------------------------------------------
OpenMP parallelization
The code already has OpenMP directives. Add the "-qopenmp" option to the compiler command line to activate OpenMP. Run the code on the physical cores of one ccNUMA domain. What is the performance on the full domain as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?
Do and end-to-end performance counter measurement of the memory bandwidth of the program:
$ srun -u --cpu-freq=2400000-2400000:performance likwid-perfctr -C S0:0-17 -g MEM ./mfcg 8000 20000
Remembering what you learned about the memory bandwidth of a socket in the hands-on about the BWBENCH code, is your hypothesis confirmed?
Make a Roofline model of the hot spot loop you have identified above.
Performance profiling
Instrument the hot spot loop in
the source code with the LIKWID marker API (there are some source files
there to help you). Note that the header file is called
"likwid-marker.h" since LIKWID version 5. For threaded code,
LIKWID_MARKER_START() must be called from every executing thread!
(Fortran programmers please look at https://github.com/RRZE-HPC/likwid/wiki/TutorialMarkerF90).
Build the new version with:
$ icx -qopenmp -Ofast -xhost -std=c99 -DLIKWID_PERFMON -o ./mfcg $LIKWID_INC ./mfcg-marker.c $LIKWID_LIB -llikwid
or:
$ ifort -qopenmp -Ofast -xhost -o ./mfcg $LIKWID_INC ./mfcg-marker.f90 $LIKWID_LIB -llikwid timingC.o
Test your new version using:
$ srun -u --cpu-freq=2400000-2400000:performance likwid-perfctr -C S0:0-17 -g MEM_DP -m ./mfcg 8000 20000
Does the hot spot loop run according to your Roofline prediction?
Validate your model by checking your data traffic volume prediction.
Hint: The size of one vector in the code is the product of the two
command line arguments; the number of times your hotspot loop is
executed is given in the likwid-perfctr output. You can take it from
there )
If you have the time you can continue the analysis with the other important loops in the code. Does your code's performance scale from one socket to two?
Analysis, Optimization, and Validation
Can you think of an optimization that would improve the performance of the entire algorithm? Think about what the overall bottleneck is and how you can reduce the "pressure" on this bottleneck.
You don't have to go all the way with this. Implement something of which you know it will improve performance. Try to validate the effect with measurements.