Exercise: Profiling a simple CG solver
In
this hands-on we analyze a conjugate-gradient (CG) linear solver which
is derived from the popular HPCG benchmark. It is "matrix free" because
it does not actually store the coefficient matrix; the matrix is
hard-coded so that the sparse MVM kernel is actually a Jacobi-like
stencil update scheme. The source can be found in the folder MFCG.
Build and run
Build the executable from C with:
$ icx -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c
Test if it is working:
$ ./mfcg 8000 8000
The
problem size is specified as two numbers: The outer (first argument)
and the inner dimension (second argument) of the grid. The code implements a standard Conjugate-Gradient (CG) algorithm without
a stored matrix, i.e., the matrix-vector multiplication is a stencil
update sweep. Performance is printed in millions of lattice site updates
per second.
Time profile
Compile the code with the -pg switch to instrument with gprof:
$ icx -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.cAfter running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool:
$ gprof ./mfcg
The result is a "flat profile" and a "butterfly graph" of the application. look at the flat profile and think about what went wrong here. Fix it and do the profile again.
What is the "hot spot" of the program?
OpenMP parallelization (optional)
The code in mfcg-omp.c already has OpenMP directives. Remove the "-pg" option and add the "-qopenmp" option to the compiler command line to activate OpenMP. Run the code on the 36 physical cores of one Fritz socket. What is the performance on the full socket as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?