In this hands-on we will explore performance properties of dense matrix-vector multiplication: 

        for (int c=0; c<N_cols; c++) {
            for (int r=0; r<N_rows; r++) {
                y[r] = y[r] + a[c*N_rows+r] * x[c];
            }
        }

Preparation

The code is available in the folder DMVM.

The C (F90) source code is available in the C (F90) folder, respectively. The job script job-marvin.sh builds the binary and runs it in a loop: The number of columns is fixed at 10000, the number of rows goes from 1000 to 300000. The script reports the performance in Mflop/s for each setting of N_nows.

Look at the performance vs. the number or rows: What could have happened here?

---- END OF PART 1 --------------

 
 

Performance profiling

There are C and F90 versions of the source code that use the LIKWID marker API around the computational kernel. You can uncomment the appropriate lines in the job script to build and run this version.

(If you want to use the F90 code, you must use gfortran to build it since the LIKWID module was built with GCC).

The script runs the code three times, each time measuring data traffic in bytes per iteration:

  • L2 -> L1 read traffic
  • L3 <-> L2 read/write traffic
  • Memory <-> CPU read/write traffic
 

The output of this has 3 columns: The number of matrix rows, the performance, and the number of bytes transferred over the respective data path.

What are your observations? Can you correlate the observed performance drops with traffic behavior? Which data structure is responsible for this?





Last modified: Tuesday, 17 March 2026, 5:50 AM