PTfS25: Assignment 11 (last) | NHR Learning Platform

Opened: Monday, 14 July 2025, 12:00 AM

Due: Thursday, 24 July 2025, 12:20 PM

Sparse MVM.
You are given a sparse matrix stored in CRS format with the following properties:
- Number of nonzeros: $ 225\times 10^6 $
- matrix values data type: float
- Number of rows: $ 26.5\times 10^6 $
- column index data type: uint32
- row pointer data type: uint32
- Input and output vectors for SpMV data type: float
You have to implement an algorithm which has SpMV operations with this matrix as a hot spot.

(a) (10 crd) Calculate the optimistic memory-bound upper performance limit in Gflop/s for an SpMV operation $ y=y+Ax $ with this matrix on an NVIDIA A100 GPU (memory bandwidth 1.3 Tbyte/s, single-precision peak performance 19.4 Tflop/s). Also calculate how long one SpMV takes, at least, in seconds (assuming that all data is already on the device).

(b) (5 crd) You run an SpMV benchmark with this matrix and measure a performance of 190 Gflop/s which falls short of your expectations (well at least it should). With a profiling tool, you also measure an overall data traffic volume over the memory bus of 2.3 Gbyte per SpMV. From these measurements, calculate the actual code balance for this SpMV.

(c) (10 crd) Calculate how often the RHS vector is loaded from memory and what fraction of the maximum memory bandwidth is used by the SpMV. What would you suggest to try in order to improve the performance of the kernel on this GPU?
Slow computing.
It is often stated (sometimes humorously) that all you have to do to make your program scale is to compile it with the "-O0" optimization flag. Substantiate this claim using a speedup model based on Amdahl's Law.

Assume that Amdahl's Law holds for a given code with serial execution time $T(1)=T_s+T_p$, where $T_s$ is the serial runtime of the non-parallelizable part and $T_p$ is the serial runtime of the perfectly parallelizable part. In the parallel case with $N$ workers, an additional runtime overhead of $T_c(N)$ applies.

(a) (5 crd) Formulate a model for the strong-scaling speedup function $S(N)$ in terms of dimensionless quantities $s$, $p$, and $c(N)$ for serial fraction, parallel fraction, and dimensionless communication overhead. Hint: $ 0\leq s\leq 1 $ and $ 0\leq p\leq 1 $
(b) (15 crd) Now assume that everything that pertains to code execution is slowed down by a given factor $\mu<1$, i.e., for $\mu=0.5$ the code executes with half the original performance. However, the communication time stays unchanged. This mimics the "slow program" scenario, the use of slower CPUs, or a clock frequency reduction. The speedup is now a function of $\mu$, i..e, $S(N,\mu)$. Explain if the statement from the beginning is justified, i.e., can you improve the scaling of your code by making it run slower? What is the only prerequisite for that (assuming that Amdahl's Law with communication holds)? Is speedup a good metric for assessing the quality of a parallel code?
A CG solver.
In ~ptfs100h/GettingStarted/mfcg.c you will find a simple matrix-free conjugate-gradient solver. Is is derived from the popular HPCG benchmark. It is "matrix free" because it does not actually store the coefficient matrix; the matrix is hard-coded so that the sparse MVM kernel is actually a Jacobi-like stencil update scheme.

Build the executable with:
```
$ icx -Ofast -xHost -std=c99 -o ./mfcg ./mfcg.c
```
```
 
```
Test if it is working:
```
$ likwid-pin -c S0:2 ./mfcg  8000 20000
```
The problem size is specified as two numbers: The outer (first argument) and the inner dimension (second argument) of the grid. Performance-wise, this is important only for the stencil update part of the algorithm. The code implements a standard Conjugate-Gradient (CG) algorithm without a stored matrix, i.e., the matrix-vector multiplication is a stencil update sweep. Performance is printed in millions of lattice site updates per second. This number pertains to the whole algorithm (i.e., the iteration loop excluding initializations), not just the stencil update.

(a) (15 crd) Parallelize the program with OpenMP. Include the code in your submission. The iteration loop of the algorithm is implemented in the function CG(). This gives you a hint which loops need to be parallelized.

(b) (15 crd) Run the code with 1..18 threads on the cores of one ccNUMA domain of a Fritz CPU with a fixed clock speed of 2.4 GHz and the problem size as above (8000x20000). Make a graph of performance in MLUP/s vs. #cores.

(c) (25 crd) Calculate an upper performance limit in MLUP/s for the code on one ccNUMA domain (memory bandwidth about 80 Gbyte/s) by looking at the loops you parallelized and estimating the data traffic. Remember that one LUP in this case is one inner loop update of the whole algorithm, i.e., one iteration of the CG algorithm executes 8000x20000 LUPs. Draw this limit into your diagram from (b).