Exercise: Profiling a simple CG solver
In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme. The source can be found in the folder MFCG.
Build and run
Build the executable from C with:
$ icx -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c
For Fortran:
$ ifx -Ofast -xHost -o ./mfcg ./mfcg.f90
Test if it is working:
$ ./mfcg 8000 8000
The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second.
Time profile
Compile the code with the -pg switch to instrument with gprof:
For C:
$ icx -Ofast -xHost -std=c99 -pg -o ./mfcg ./mfcg.c
For Fortran:
$ ifx -Ofast -xHost -pg -o ./mfcg ./mfcg.f90
After running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool:
$ gprof ./mfcg
The result is a "flat profile" and a "butterfly graph" of the application. Look at the flat profile: Which functions take most of the runtime (the "hot spots")? Is this useful information? Think about what went wrong here (this problem appears only in the C version). Fix it and do the profile again.