NLPE@LRZ: Hands-on: Dense matrix-vector multiplication

In this hands-on we will explore performance properties of dense matrix-vector multiplication:

        for (int c=0; c<N_cols; c++) {
            for (int r=0; r<N_rows; r++) {
                y[r] = y[r] + a[c*N_rows+r] * x[c];
            }
        }

Preparation

The code is available in the folder DMVM.

The C (F90) source code is available in the C (F90) folder, respectively. Build the executable from C with:

$ icx -Ofast -xHost -std=c99 -o ./dmvm ./dmvm.c

For Fortran:

$ ifx -Ofast -xHost -o ./dmvm ./dmvm.f90

Test if it is working:

$ srun --cpu-freq=2400000-2400000:performance likwid-pin -c S0:2 ./dmvm  5000 5000   # takes rows and columns as args

The output shows the number of repetitions, the problem size, and the performance in Mflop/s. There is a helper script ./bench.pl in the DMVM folder that that allows to scan data set size. Use it as follows (here using the compiled Fortran code as an example):

$ srun --cpu-freq=2400000-2400000:performance ./bench.pl F90/dmvm <N columns>

It is important here to fix the clock frequency!

The script keeps the number of columns constant (as given - we recommend 10000) and scans the number of rows from 1000 to 200000. It stops if the overall working set is larger than 2 GB. If you pipe the output of bench.pl into the file bench.dat, you can generate a png plot of the result with gnuplot with:

$ gnuplot bench.plot

Remember that the input data for this gnuplot script is expected in bench.dat!

Benchmarking

What do we expect based on the static code analysis? What does this mean for benchmark planning?

Set the number of columns to 10000 and scan the number of rows with (this should take less than two minutes):

$ srun --cpu-freq=2400000-2400000:performance ./bench.pl ./C/dmvm 10000 | tee bench.dat

What do we learn from the result? Is this what we expected? How can we measure what is going on?

---- END OF PART 1 --------------

Performance profiling

Instrument the source code with the LIKWID marker API.

Build the new version with:

$ icx -Ofast -xHost -std=c99 -DLIKWID_PERFMON  -o ./dmvm $LIKWID_INC ./dmvm-marker.c  $LIKWID_LIB -llikwid

$ ifx -Ofast -xHost -o dmvm $LIKWID_INC ./dmvm-marker.f90 $LIKWID_LIB -llikwid

Test your new version using:

$ likwid-perfctr -C S0:3 -g MEM_DP -m ./dmvm 15000 10000

For these runs, frequency fixing is not necessary since we are going to measure data traffic; performance is not important.

Repeat the scan of row count using the following command (the last option is a LIKWID metric group name):

$ ./bench-perf.pl C/dmvm 10000  MEM

The output of this has 3 columns: The number of matrix rows, the performance, and the number of bytes transferred to and from memory (if using the MEM group) per iteration. If you specify "L2" or "L3," you get the traffic to the L2 and L3 cache, respectively.

If you want you can modify the bench.plot script (change "1:2" on the last line to "1:3") to get plots of data traffic (bytes per iteration) versus number of rows, but looking at the raw data does the job pretty well, too.

What are your observations? Can you correlate the observed performance drops with traffic behavior? Which data structure is responsible for this?

--- END OF PART 2 --------------

Optional: Optimization and Validation

What can we do about the performance drops?

Plan and implement an optimization called spatial cache blocking. Allow to configure for which target cache you block.

Repeat benchmarking only (not setting the -DLIKWID_PERFMON define) and validate the results with profiling.

Going parallel

Parallelize both the initial and optimized version with OpenMP. Take care of the reduction on y!

Benchmark the results and scale out within one socket. What are the results?

Last modified: Wednesday, 4 December 2024, 8:55 AM