In this hands-on we analyze a conjugate-gradient (CG) linear solver which is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme. The source can be found in the folder MFCG.

Build and run

Build the executable from C with:

$ icx -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c

Test if it is working:

$ ./mfcg  8000 8000

The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second

Time profile

Compile the code with the -pg switch to instrument with gprof:

$ icx -Ofast -xhost -std=c99 -pg -o ./mfcg ./mfcg.c
After running the the code you end up with a gmon.out file, which can be converted to readable form by the gprof tool:

$ gprof ./mfcg

The result is a "flat profile" and a "butterfly graph" of the application. look at the flat profile and think about what went wrong here. Fix it and do the profile again. 

What is the "hot spot" of the program?

OpenMP parallelization (optional)

The code in mfcg-omp.c already has OpenMP directives. Remove the "-pg" option and add the "-qopenmp" option to the compiler command line to activate OpenMP. Run the code on the 36 physical cores of one Fritz socket. What is the performance on the full socket as compared to the serial version? Do you have a hypothesis about what the bottleneck might be?

Last modified: Thursday, 22 February 2024, 11:32 AM