



Friedrich-Alexander-Universität Erlangen-Nürnberg

## Case study: Sparse Matrix-Vector Multiplication



## Sparse Matrix Vector Multiplication (SpMV)

- Key ingredient in some matrix diagonalization algorithms
  - Lanczos, Davidson, Jacobi-Davidson
- Store only N<sub>nz</sub> nonzero elements of matrix and RHS, LHS vectors with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>
- Average number of nonzeros per row:  $N_{nzr} = N_{nz}/N_r$



## **SpMVM** characteristics

- For large problems, SpMV is inevitably memory-bound
  - Intra-socket saturation effect on modern multicores
- SpMV is easily parallelizable in shared and distributed memory
  - Load balancing
  - Communication overhead
- Data storage format is crucial for performance properties
  - Most useful general format on CPUs: Compressed Row Storage (CRS)
  - Depending on compute architecture

## CRS matrix storage scheme



- val[] stores all the nonzeros (length N<sub>nz</sub>)
- col\_idx[] stores the column index of each nonzero (length N<sub>nz</sub>)
- row\_ptr[] stores the starting index of each new row in val[] (length: N<sub>r</sub>)



## Case study: Sparse matrix-vector multiply

- Strongly memory-bound for large data sets
  - Streaming, with partially indirect access:

```
!$OMP parallel do schedule(???)
do i = 1,Nr
  do j = row_ptr(i), row_ptr(i+1) - 1
    C(i) = C(i) + val(j) * B(col_idx(j))
  enddo
enddo
!$OMP end parallel do
```

Usually many spMVMs required to solve a problem

Now let's look at some performance measurements...

- Strongly memory-bound for large data sets → saturating performance across cores on the chip
- Performance seems to depend on the matrix
- Can we explain this?
- Is there a "light speed" for SpMV?

Optimization?



### SpMV node performance model – CRS (1)

```
      do i = 1, N_r \\       do j = row_ptr(i), row_ptr(i+1) - 1 \\       C(i) = C(i) + val(j) * B(col_idx(j)) \\       enddo \\        enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       enddo \\       end \\       end \\       end \\       end \\       end \\       end
```

Min. load traffic [B]: 
$$(8 + 4) N_{nz} + (4 + 8)N_r + 8 N_c$$
  
Min. store traffic [B]:  $8 N_r$   
Total FLOP count [F]:  $2 N_{nz}$ 

$$B_{C,min} = \frac{12 N_{nz} + 20 N_r + 8 N_c}{2 N_{nz}} \frac{B}{F} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
Nonzeros per row  $(N_{nzr} = N_{nz}/N_r)$  or column  $(N_{nzc} = N_{nz}/N_c)$ 
Lower bound for code balance:  $B_{C,min} \ge 6 \frac{B}{F} \rightarrow I_{max} \le \frac{1}{6} \frac{F}{B}$ 

### SpMV node performance model – CRS (2)

$$B_{C,min} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
$$B_{C}(\alpha) = \frac{12 + 20/N_{nzr} + 8\alpha}{2} \frac{B}{F}$$

Consider square matrices:  $N_{nzc} = N_{nzr}$  and  $N_c = N_r$ Note:  $B_C (1/N_{nzr}) = B_{C,min}$ 



Parameter ( $\alpha$ ) quantifies additional traffic for **B(:)** (irregular access):

$$\alpha \ge \frac{1}{N_{nzc}}$$

$$\alpha N_{nzc} \geq 1$$

## The " $\alpha$ effect"

- DP CRS code balance
- α quantifies the traffic for loading the RHS
  - $\alpha = 0 \rightarrow \text{RHS}$  is in cache
  - $\alpha = 1/N_{nzr}$   $\rightarrow$  RHS loaded once
  - $\alpha = 1 \rightarrow$  no cache
  - $\alpha > 1 \rightarrow$  Houston, we have a problem!
- "Target" performance =  $b_S/B_c$
- Caveat: Maximum memory BW may not be achieved with spMVM (see later)
- Can we predict  $\alpha$ ?
- Not in general
- Simple cases (banded, block-structured): Similar to layer condition analysis

 $\rightarrow$  Determine  $\alpha$  by measuring the actual memory traffic ( $\rightarrow$  measured code balance  $B_C^{meas}$ )

$$B_{C}(\alpha) = \frac{12 + 20/N_{nzr} + 8\alpha}{2} \frac{B}{F}$$
$$= \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F}$$

## Determine $\alpha$ (RHS traffic quantification)

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2F} \quad (= B_C^{meas})$$

- V<sub>meas</sub> is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for α:

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection on one Intel SNB socket

• 
$$N_{nz} = 14.6 \cdot 10^6$$
,  $N_{nzr} = 7.1$ 

•  $V_{meas} \approx 258 \text{ MB}$ 

$$\rightarrow \alpha = 0.36, \, \alpha N_{nzr} = 2.5$$

 $\rightarrow$  RHS is loaded 2.5 times from memory

 $B_{C,min}$ 

 $\frac{B_C(\alpha)}{1} = 1.11$ 

and:



### Three different sparse matrices

#### Benchmark system: Intel Xeon Ivy Bridge E5-2660v2, 2.2 GHz, $b_S = 46.6 \text{ GB/s}$

→ Roofline:  $P_{opt} = {}^{b_S} / {}_{B_{C,min}}$ 

| Matrix    | Ν         | N <sub>nzr</sub> | B <sub>C,min</sub> [B/F] | $P_{opt}$ [GF/s] |
|-----------|-----------|------------------|--------------------------|------------------|
| DLR1      | 278,502   | 143              | 6.1                      | 7.64             |
| scai1     | 3,405,035 | 7.0              | 8.0                      | 5.83             |
| kkt_power | 2,063,494 | 7.08             | 8.0                      | 5.83             |



Roofline Case Studies | SpMV

### Now back to the start...



## Roofline analysis for spMVM

- Conclusion from the Roofline analysis
  - The roofline model does not "work" for spMVM due to the RHS traffic uncertainties
  - We have "turned the model around" and measured the actual memory traffic to determine the RHS overhead
  - Result indicates:
    - 1. how much actual traffic the RHS generates
    - 2. how efficient the RHS access is (compare BW with max. BW)
    - 3. how much optimization potential we have with matrix reordering
- Do not forget about load balancing!
- Consequence: Modeling is not always 100% predictive. It's all about *learning more* about performance properties!

# What about GPUs?

- GPUs need
  - Enough work per kernel launch in order to leverage their parallelism
  - Coalesced access to memory (consecutive threads in a warp should access consecutive memory addresses)
- Plain CRS for SpMV on GPUs is not a good idea
  - 1. Short inner loop
  - 2. Different amount of work per thread
  - 3. Non-coalesced memory access
- Remedy: Use SIMD/SIMT-friendly storage format
   ELLPACK, SELL-C-σ, DIA, ESB,...



# CRS SpMV in CUDA (y = Ax)

```
template <typename VT, typename IT>
global static void
spmv csr(const ST num rows,
          const IT * RESTRICT row_ptrs, const IT * RESTRICT col_idxs,
          const VT * RESTRICT values, const VT * RESTRICT x,
                                                  VT * RESTRICT \mathbf{v})
{
    ST row = threadIdx.x + blockDim.x * blockIdx.x; // 1 thread per row
    if (row < num rows) {</pre>
        VT sum{};
        for (IT j = row ptrs[row]; j < row ptrs[row + 1]; ++j) {</pre>
             sum += values[j] * x[col idxs[j]];
        y[row] = sum;
                                                          B_c(\alpha) = \left(6 + 4\alpha + \frac{6}{N_m}\right)\frac{B}{F}
```

No write-allocate on GPUs for consecutive stores

# SpMV CRS performance on a GPU



NVIDIA Ampere A100 Memory bandwidth  $b_S = 1400 \text{ GB/s}$ 

- Strong "α effect" large deviation from optimal α for many matrices
  - Many cache lines touched b/c every thread handles one row → bad cache usage
- Mediocre memory bandwidth usage (« 1400 GB/s) in many cases
  - Non-coalesced memory access
  - Imbalance across rows/threads of warps

# SELL-C- $\sigma$

Idea

M. Kreutzer et al.: A Unified Sparse Matrix Data Format For Efficient General Sparse Matrix-vector Multiplication On Modern Processors With Wide SIMD Units, SIAM SISC 2014, DOI: <u>10.1137/130930352</u>

- Sort rows according to length within sorting scope  $\sigma$
- Store nonzeros column-major in zero-padded chunks of height C



# SELL-C- $\sigma$ SpMV in CUDA (y=Ax)

```
ST row = threadIdx.x + blockDim.x * blockIdx.x;
ST c = row / C; // the no. of the chunk
ST idx = row % C; // index inside the chunk
```

```
if (row < n_chunks * C) {
    VT tmp{};
    IT cs = chunk_ptrs[c]; // points to start indices of chunks</pre>
```

```
for (ST j = 0; j < chunk_lengths[c]; ++j) {
    tmp += values[cs + idx] * x[col_idxs[cs + idx]];
    cs += C;
}
y[row] = tmp;</pre>
```



# Code balance of SELL-C- $\sigma$ (y=Ax)



When measuring  $B_C^{meas}$ , take care to use the "useful" number of flops (excluding zero padding) for work

# How to choose the parameters *C* and $\sigma$ on GPUs?

#### • *C*

 n × warp size to allow good utilization of GPU threads and cache lines

#### • *σ*

- As small as possible, as large as necessary
- Large  $\sigma$  reduces zero padding (brings  $\beta$  closer to 1)
- Sorting alters RHS access pattern  $\rightarrow \alpha$  depends on  $\sigma$



# SpMV node performance model – GPU



 $\square - \square B_C(alpha)$ 

6 byte/flop

**Θ-Θ** P measured

♦ –♦  $P(B_{C,min})$ 

rrze3\_vv

rrze3

scail

scai2

 $\square - \square P(B_C(alpha))$ 

♦-♦ B<sub>C,min</sub>