In this hands-on we will explore performance properties of dense matrix-vector multiplication: 

        for (int c=0; c<N_cols; c++) {
            for (int r=0; r<N_rows; r++) {
                y[r] = y[r] + a[c*N_rows+r] * x[c];
            }
        }

Preparation

Copy the source files to your home directory via:

$ cp -a ~dc-grub1/NLPE-DURHAM $HOME
$ cd NLPE-DURHAM/DMVM

Read and edit the job script

The job script lets you decide whether you want to use C or Fortran and whether GCC or the Intel ICX compilers should be used.

First you can choose the compiler by loading the appropriate module(s). The default is GCC. By commenting the gcc line and uncommenting the two modules lines for the Intel compiler lines, you can switch to Intel.

Afterwards there are two blocks for compilations where you can comment in/out to get to the desired compiler and language.

The job script compiles the code and runs it under the regime of srun. It fixes the CPU frequency to 2.1 GHz and requests the CPU governor performance to get stable results at base frequency. Afterwards it compiles the code again with the LIKWID Marker API. It runs the code again with the CPU settings and measures the L1 <-> L2 traffic, the L2 <-> L3 traffic and the main memory traffic.

Submit the job


$ sbatch job-dine2-part1.sh

The job creates a common SLURM output file with the output of the run. The output of dmvm shows the number of repetitions, the problem size, and the performance in Mflop/s.

Then it runs a helper script ./bench.pl in the DMVM folder that that allows to scan data set size. The scripts bench.pl keeps the number of columns constant (as given - we recommend 10000) and scans the number of rows from 1000 to 200000. It stops if the overall working set is larger than 2 GB. The output of this data set size scan is stored in bench-perf.dat

 
 

Benchmarking

What do we expect based on the static code analysis? 

 

Take a look at bench-perf.dat. The first column is the number of rows, the second the performance in MFlops/s. What do we learn from the result? Is this what we expected?

# dmvm 10000
1000 5191.22
1100 5070.24
[...]

You can plot the data on the frontend in the terminal with gnuplot and the provided config. 

$ cd NLPE-Durham/DMVM
$ ln bench-perf.dat bench.dat
$ gnuplot bench.plot
 
 

--- END OF PART 1 --------------

--- BEGIN OF PART 2 --------------

 
 

Performance profiling

 

Copy the source code and instrument it with the LIKWID marker API.

$ cp C/dmvm.c C/dmvm-marker.c
# $ cp F90/dmvm.f90 F90/dmvm-marker.f90

When you done so, run it on the DINE2 nodes

$ sbatch job-dine2-part2.sh

Look at the SLURM output, it contains the output of one run with likwid-perfctr measuring the L2 cache traffic

Take a look at the different hardware performance counter measurements in bench-<L2|L3|MEM>.dat. First column is again the number of rows, second the performance in MFlops/s (but not relevant now) and the third column is the data volume to and from the L2, L3 and memory for each inner loop iteration.

# dmvm 10000 L2
1000 589.98 8.05
1100 626.41 8.05

You can plot the data on the frontend in the terminal with gnuplot and the provided config. 

$ cd NLPE-Durham/DMVM
$ ln bench-<L2|L3|MEM>.dat bench.dat
# edit bench.plot to use 1:3 instead of 1:2 in plot line
$ gnuplot bench.plot

What are your observations? Can you correlate the observed performance drops with traffic behavior? Which data structure is responsible for this?

--- END OF PART 2 --------------

--- BEGIN OF PART 3 --------------

Optional: Optimization and Validation

What can we do about the performance drops?

Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.

Copy instrumented code and start your implementation

$ cp C/dmvm-marker.c C/dmvm-opt.c
# $ cp F90/dmvm-marker.f90 F90/dmvm-opt.f90

You can get measurements again by submitting the third job script

$ sbatch job-dine2-part3.sh

Did your optimization work out? Can you point to differences in the measurments that show the success?

--- END OF PART 3 --------------

--- BEGIN OF PART 4 --------------

Going parallel

Parallelize both the initial and optimized version with OpenMP. Take care of the reduction on y!

Benchmark the results and scale out within one socket. What are the results?

Use the file names dmvm-omp.c and dmvm-omp-opt.c (suffix f90 for Fortran). The job script executes compilation of dmvm-omp-opt.c only if it exists.

$ sbatch job-dine2-part4.sh



Last modified: Friday, 12 June 2026, 6:39 PM