Hands-on: Dense matrix-vector multiplication
In this hands-on we will explore performance properties of dense matrix-vector multiplication:
for (int c=0; c<N_cols; c++) {
for (int r=0; r<N_rows; r++) {
y[r] = y[r] + a[c*N_rows+r] * x[c];
}
}
Preparation
Copy the source files to your home directory via:
$ cp -a ~r14s0001/DMVM ~
Preparation
The C (F90) source code is available in the C (F90) folder, respectively. Build the executable from C with:
$ icx -Ofast -xHost -std=c99 -o ./dmvm ./dmvm.c
For Fortran:
$ ifx -Ofast -xHost -o ./dmvm ./dmvm.f90
Test if it is working:
$ likwid-pin -c S0:2 ./dmvm 5000 5000 # takes rows and columns as args
The output shows the number of repetitions, the problem size, and the performance in Mflop/s. There is a helper script ./bench.pl in the DMVM folder that that allows to scan data set size. Use it as follows (here using the compiled Fortran code as an example):
$ ./bench.pl F90/dmvm <N columns>
It is important here to fix the clock frequency but we did that already for you!
The script keeps the number of columns constant (as given - we recommend 10000) and scans the number of rows from 1000 to 200000. It stops if the overall working set is larger than 2 GB. If you pipe the output of bench.pl into the file bench.dat, you can generate a png plot of the result with gnuplot with:
$ gnuplot bench.plot
Remember that the input data for this gnuplot script is expected in bench.dat!
Since gnuplot is not installed on the compute nodes, you have to load a module first:
$ module use ~unrz139/.modules/modulefiles
$ module load 000-all-spack-pkgs/0.18.0
$ module load fontconfig
$ module load gnuplot
Benchmarking
What do we expect based on the static code analysis? What does this mean for benchmark planning?
Set the number of columns to 10000 and scan the number of rows with (this should take less than two minutes):
$ ./bench.pl ./C/dmvm 10000 | tee bench.dat
What do we learn from the result? Is this what we expected? How can we measure what is going on?
---- END OF PART 1 --------------
Performance profiling
Instrument the source code with the LIKWID marker API.
Build the new version with:
$ icx -Ofast -xHost -std=c99 -DLIKWID_PERFMON -o ./dmvm $LIKWID_INC ./dmvm-marker.c $LIKWID_LIB -llikwid
or
$ ifx -Ofast -xHost -o dmvm $LIKWID_INC ./dmvm-marker.f90 $LIKWID_LIB -llikwid
Test your new version using:
$ likwid-perfctr -C S0:3 -g MEM_DP -m ./dmvm 15000 10000
Repeat the scan of row count using the following command (the last option is a LIKWID metric group name):
$ ./bench-perf.pl C/dmvm 10000 MEM
The output of this has 3 columns: The number of matrix rows, the performance, and the number of bytes transferred to and from memory (if using the MEM group) per iteration. If you specify "L2" or "L3," you get the traffic to the L2 and L3 cache, respectively.
If you want you can modify the bench.plot script (change "1:2" on the last line to "1:3") to get plots of data traffic (bytes per iteration) versus number of rows, but looking at the raw data does the job pretty well, too.
What are your observations? Can you correlate the observed performance drops with traffic behavior? Which data structure is responsible for this?
--- END OF PART 2 --------------
Optional: Optimization and Validation
What can we do about the performance drops?
Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.
Repeat benchmarking only (not setting the -DLIKWID_PERFMON define) and validate the results with profiling.
Going parallel
Parallelize both the initial and optimized version with OpenMP. Take care of the reduction on y!
Benchmark the results and scale out within one socket. What are the results?