Hands-on: dense matrix-vector multiplication
In this hands-on we will explore performance properties of dense matrix-vector multiplication:
for (int c=0; c<N_cols; c++) {
for (int r=0; r<N_rows; r++) {
y[r] = y[r] + a[c*N_rows+r] * x[c];
}
}
Preparation
The code is available in the folder DMVM.
The C (F90) source code is available in the C (F90) folder, respectively. The job script builds the binary and runs it in a loop: The number of columns is fixed at 10000, the number of rows goes from 1000 to 300000. The script reports the performance in Mflop/s for each setting of N_nows.
Look at the performance vs. the number or rows: What could have happened here?
---- END OF PART 1 --------------
Performance profiling
Instrument the source code with the LIKWID marker API.
Build the new version with:
$ icx -Ofast -xHost -std=c99 -DLIKWID_PERFMON -o ./dmvm $LIKWID_INC ./dmvm-marker.c $LIKWID_LIB -llikwid
or
$ ifx -Ofast -xHost -o dmvm $LIKWID_INC ./dmvm-marker.f90 $LIKWID_LIB -llikwid
Test your new version using:
$ likwid-perfctr -C S0:3 -g MEM_DP -m ./dmvm 15000 10000
For these runs, frequency fixing is not necessary since we are going to measure data traffic; performance is not important.
Repeat the scan of row count using the following command (the last option is a LIKWID metric group name):
$ ./bench-perf.pl C/dmvm 10000 MEM
The output of this has 3 columns: The number of matrix rows, the performance, and the number of bytes transferred to and from memory (if using the MEM group) per iteration. If you specify "L2" or "L3," you get the traffic to the L2 and L3 cache, respectively.
If you want you can modify the bench.plot script (change "1:2" on the last line to "1:3") to get plots of data traffic (bytes per iteration) versus number of rows, but looking at the raw data does the job pretty well, too.
What are your observations? Can you correlate the observed performance drops with traffic behavior? Which data structure is responsible for this?
--- END OF PART 2 --------------
Optional: Optimization and Validation
What can we do about the performance drops?
Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.
Repeat benchmarking only (not setting the -DLIKWID_PERFMON define) and validate the results with profiling.
Going parallel
Parallelize both the initial and optimized version with OpenMP. Take care of the reduction on y!
Benchmark the results and scale out within one socket. What are the results?