



# Case study: Sparse Matrix-Vector Multiplication



## Sparse Matrix Vector Multiplication (SpMV)

- Key ingredient in numerous sparse linear algebra solvers
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>
- Average number of nonzeros per row:  $N_{nzr} = N_{nz}/N_{r}$



## SpMVM characteristics

- For large problems, SpMV is inevitably memory-bound
  - Intra-socket saturation effect on modern multicores
- SpMV is easily parallelizable in shared and distributed memory
  - Load balancing
  - Communication overhead
- Data storage format is crucial for performance properties
  - Most useful general format on CPUs: Compressed Row Storage (CRS)
  - Depending on compute architecture

### CRS matrix storage scheme



- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)
- row\_ptr[] stores the starting index of each new row in val[] (length: N<sub>r</sub>)



### Case study: Sparse matrix-vector multiply

- Strongly memory-bound for large data sets
  - Streaming, with partially indirect access:

```
\label{eq:somp} \begin{tabular}{ll} !\$OMP parallel do schedule(???) \\ do i = 1,N_r \\ do j = row\_ptr(i), row\_ptr(i+1) - 1 \\ C(i) = C(i) + val(j) * B(col\_idx(j)) \\ enddo \\ enddo \\ !\$OMP end parallel do \\ \end{tabular}
```

- Usually many spMVMs required to solve a problem
- Now let's look at some performance measurements...

#### Performance characteristics

- Strongly memory-bound for large data sets → saturating performance across cores on the chip
- Performance seems to depend on the matrix
- Can we explain this?

Is there a "light speed" for SpMV?

Optimization?



## SpMV node performance model – CRS (1)

```
do i = 1, N_r
do j = row_ptr(i), row_ptr(i+1) - 1
C(i) = C(i) + val(j) * B(col_idx(j))
enddo
enddo
```

```
real*8 val(N<sub>nz</sub>)
integer*4 col_idx(N<sub>nz</sub>)
integer*4 row_ptr(N<sub>r</sub>)
real*8 C(N<sub>r</sub>)
real*8 B(N<sub>c</sub>)
```

Min. load traffic [B]:  $(8 + 4) N_{nz} + (4 + 8) N_r + 8 N_c$ 

Min. store traffic [B]:  $8 N_r$ 

Total FLOP count [F]:  $2 N_{nz}$ 

$$B_{C,min} = \frac{12 N_{nz} + 20 N_r + 8 N_c}{2 N_{nz}} \frac{B}{F} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
Nonzeros per row  $(N_{nzr} = N_{nz}/N_r)$  or column  $(N_{nzc} = N_{nz}/N_c)$ 

Lower bound for code balance:  $B_{C,min} \ge 6 \frac{B}{F} \longrightarrow I_{max} \le \frac{1}{6} \frac{F}{B}$ 

## SpMV node performance model – CRS (2)

do i = 1, 
$$N_r$$
  
do j =  $row_ptr(i)$ ,  $row_ptr(i+1) - 1$   
 $C(i) = C(i) + val(j) * B(col_idx(j))$   
enddo  
enddo



$$B_C(\alpha) = \frac{12 + 20/N_{nzr} + 8 \alpha}{2} \frac{B}{F}$$





Parameter ( $\alpha$ ) quantifies additional traffic for **B** (:) (irregular access):

$$\alpha \ge 1/N_{nzc}$$

$$\alpha N_{nzc} \geq 1$$

#### The " $\alpha$ effect"

#### DP CRS code balance

- α quantifies the traffic for loading the RHS
  - $\alpha = 0 \rightarrow RHS$  is in cache
  - $\alpha = 1/N_{nzr}$   $\rightarrow$  RHS loaded once
  - $\alpha = 1 \rightarrow \text{no cache}$
  - $\alpha > 1 \rightarrow$  Houston, we have a problem!
- "Target" performance =  $b_S/B_c$
- Caveat: Maximum memory BW may not be achieved with spMVM (see later)

#### Can we predict $\alpha$ ?

- Not in general
- Simple cases (banded, block-structured): Similar to layer condition analysis
- $\rightarrow$  Determine  $\alpha$  by measuring the actual memory traffic ( $\rightarrow$  measured code balance  $B_C^{meas}$ )

$$B_{C}(\alpha) = \frac{12 + 20/N_{nzr} + 8 \alpha}{2} \frac{B}{F}$$
$$= \left(6 + 4 \alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F}$$

## Determine $\alpha$ (RHS traffic quantification)

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2 F} \quad (= B_C^{meas})$$

- $V_{meas}$  is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for  $\alpha$ :

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection on one Intel SNB socket

- $N_{nz} = 14.6 \cdot 10^6$ ,  $N_{nzr} = 7.1$
- $V_{meas} \approx 258 \text{ MB}$
- $\rightarrow \alpha = 0.36$ ,  $\alpha N_{nzr} = 2.5$
- → RHS is loaded 2.5 times from memory



11% extra traffic → optimization potential!

10

### Three different sparse matrices

Benchmark system: Intel Xeon Ivy Bridge E5-2660v2, 2.2 GHz,  $b_S = 46.6 \, \mathrm{GB/s}$ 

$$\rightarrow$$
 Roofline:  $P_{opt} = {}^{b_S}/_{B_{C,min}}$ 

| Matrix    | N         | $N_{nzr}$ | $B_{C,min}$ [B/F] | $P_{opt}$ [GF/s] |
|-----------|-----------|-----------|-------------------|------------------|
| DLR1      | 278,502   | 143       | 6.1               | 7.64             |
| scai1     | 3,405,035 | 7.0       | 8.0               | 5.83             |
| kkt_power | 2,063,494 | 7.08      | 8.0               | 5.83             |







DLR1

scai1

kkt\_power

#### Now back to the start...



- $b_S = 46.6 \, \text{GB/s}$ ,  $B_c = 6 \, \text{B/F}$
- Maximum spMVM performance:

$$P_{max} = 7.8 \,\mathrm{GF/s}$$

 DLR1 causes (almost) minimum CRS code balance (as expected)

scai1 measured balance:

 $B_c^{meas} \approx 8.5 \text{ B/F} > B_{C,min}$  (6% higher than min)

- $\rightarrow$  good BW utilization, slightly non-optimal  $\alpha$
- kkt\_power measured balance:

 $B_c^{meas} \approx 8.8 \text{ B/F} > B_{C,min}$  (10% higher than min)

→ performance degraded by load imbalance, fix by block-cyclic schedule

#### Investigating the load imbalance with kkt\_power



Measurements with likwid-perfctr (MEM\_DP group)

static

static,2048

- → Fewer overall instructions, (almost) BW saturation, 50% better performance with load balancing
- → CPI value unchanged!



#### SpMV node performance model – CPU



Intel Xeon Platinum 9242 24c@2.8GHz (turbo)  $b_S = 122 GB/s$ 

15

Matrices taken from: C. L. Alappat et al.: *ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX.* DOI: 10.1002/cpe.6512

# When Roofline for SpMV may not work

#### Reasons for performance not attaining the limit

- 1. Intensity lower than the minimum
  - More RHS traffic than the optimistic limit (  $\frac{4}{N_{nzr}}$  B/F)
- 2. "Slow code"
  - "invisible" performance ceiling due to inefficient instructions or inefficient execution
- 3. Load imbalance
  - A single process/thread cannot saturate the memory bandwidth
- 4. Erratic memory access patterns for RHS
  - Latency dominates

Roofline Case Studies | SpMV

#### What about GPUs?

- GPUs need
  - Enough work per kernel launch in order to leverage their parallelism

 Coalesced access to memory (consecutive threads in a warp should access consecutive memory addresses)

- Plain CRS for SpMV on GPUs is not a good idea
  - 1. Short inner loop
  - Different amount of work per thread
  - 3. Non-coalesced memory access
- Remedy: Use SIMD/SIMT-friendly storage format
  - ELLPACK, SELL-C-σ, DIA, ESB,...



# CRS SpMV in CUDA (y = Ax)

```
template <typename VT, typename IT>
global static void
spmv csr(const ST num rows,
         const IT * RESTRICT row_ptrs, const IT * RESTRICT col_idxs,
         const VT * RESTRICT values, const VT * RESTRICT x,
                                             VT * RESTRICT V)
   ST row = threadIdx.x + blockDim.x * blockIdx.x; // 1 thread per row
   if (row < num rows) {</pre>
       VT sum{};
        for (IT j = row_ptrs[row]; j < row_ptrs[row + 1]; ++j) {
            sum += values[j] * x[col idxs[j]];
       y[row] = sum;
```

$$B_c(\alpha) = \left(6 + 4 \alpha + \frac{6}{N_{nzr}}\right) \frac{B}{F}$$

No write-allocate on GPUs for consecutive stores

# SpMV CRS performance on a GPU



NVIDIA Ampere A100 Memory bandwidth  $b_S = 1400 \text{ GB/s}$ 

- Strong " $\alpha$  effect" large deviation from optimal  $\alpha$  for many matrices
  - Many cache lines touched b/c every thread handles one row → bad cache usage
- Mediocre memory bandwidth usage
   (<< 1400 GB/s) in many cases</li>
  - Non-coalesced memory access
  - Imbalance across rows/threads of warps

19

M. Kreutzer et al.: A Unified Sparse Matrix Data Format For Efficient General Sparse Matrix-vector Multiplication On Modern Processors With Wide SIMD Units, SIAM SISC 2014, DOI: 10.1137/130930352

20

#### Idea

- Sort rows according to length within sorting scope  $\sigma$
- Store nonzeros column-major in zero-padded chunks of height C





# SELL-C- $\sigma$ SpMV in CUDA (y=Ax)

```
template <typename VT, typename IT> global static void
spmv scs(const ST C, const ST n chunks, const IT * RESTRICT chunk ptrs,
        const IT * RESTRICT chunk lengths, const IT * RESTRICT col_idxs,
        const VT * RESTRICT values, const VT * RESTRICT x, VT * RESTRICT y)
  ST row = threadIdx.x + blockDim.x * blockIdx.x;
  ST c = row / C; // the no. of the chunk
  ST idx = row % C; // index inside the chunk
  if (row < n chunks * C) {
      VT tmp{};
      IT cs = chunk ptrs[c]; // points to start indices of chunks
      for (ST j = 0; j < chunk lengths[c]; ++j) {
          tmp += values[cs + idx] * x[col idxs[cs + idx]];
          cs += C;
      y[row] = tmp;
```

Roofline Case Studies | SpMV (c) NHR@FAU 2025

21

# Code balance of SELL-C- $\sigma$ (y=Ax)



When measuring  $B_C^{meas}$ , take care to use the "useful" number of flops (excluding zero padding) for work



22

## How to choose the parameters C and $\sigma$ on GPUs?

- **-** C
  - n × warp size to allow good utilization of GPU threads and cache lines

- **•** 0
  - As small as possible, as large as necessary
  - Large  $\sigma$  reduces zero padding (brings  $\beta$  closer to 1)
  - Sorting alters RHS access pattern  $\rightarrow \alpha$  depends on  $\sigma$



# SpMV node performance model – GPU



**NVIDIA** Ampere A100

 $b_S = 1400 \text{ GB/s}$ 



24

## Roofline analysis for spMVM

- Conclusion from the Roofline analysis
  - The roofline model does not "work" for spMVM due to the RHS traffic uncertainties
  - We have "turned the model around" and measured the actual memory traffic to determine the RHS overhead
  - Result indicates:
    - 1. how much actual traffic the RHS generates
    - 2. how efficient the RHS access is (compare BW with max. BW)
    - 3. how much optimization potential we have with matrix reordering
- Do not forget about load balancing!
- Sparse matrix times multiple vectors bears the potential of huge savings in data volume
- Consequence: Modeling is not always 100% predictive. It's all about learning more about performance properties!