# **Sparse Matrix-Vector Multiplication**

$$y = A x$$

Prof. Dr. G. Wellein<sup>(a,b)</sup>, Dr. G. Hager<sup>(a)</sup>

- (a) Erlangen National High Performance Computing Center (NHR@FAU)
- (b) Department für Informatik

Friedrich-Alexander-Universität Erlangen-Nürnberg, Sommersemester 2024



# Motivation

#### Algorithm 2. HPCG

```
1: while k \leq iter \& r_{norm}/r_0 > tol do
2: z = MG(A,r)
```

$$3: \quad oldrtz = rtz$$

4: 
$$rtz = \langle r, z \rangle$$

5: 
$$\beta = rtz/oldrtz$$

6: 
$$p = \beta * p + z$$

$$7: \qquad Ap = A * p$$

8: 
$$pAp = \langle p, Ap \rangle$$

9: 
$$\alpha = rtz/pAp$$

10: 
$$x = x + \alpha * p$$

11: 
$$r = r - \alpha * Ap$$

12: 
$$r_{norm} = \langle r, r \rangle$$

13: 
$$r_{norm} = sqrt(r_{norm})$$

14: 
$$k + +$$



Performance Modelling?
Optimal Performance?
Performance Optimizations?

# Our SpMV plan

Performance Engineering for SpMV – CPU

Data layout considerations – GPUs

#### Boundary conditions:

- Node-level (OpenMP / CUDA)
- Application problems / matrices: Standard collection / own work

# Roofline Model - Sparse Matrix Vector Multiplication

SpMV:

$$y = A x$$

Performance engineering of a single SpMV – general structure

- How to store and traverse SpMV
- Can we use RLM? What is the intenstiy of SpMV?
- Is there an maximum code intensity I for SpMV?
- Impact of matrix structure / OpenMP parallelization?
- CPU vs. GPU: Data layouts and more





# Performance Engineering for Sparse Matrix-Vector Multiplication



# Sparse Matrix Vector Multiplication (SpMV)

Key ingredient in many sparse matrix solvers / matrix diagonalization algorithms

Lanczos, Davidson, Jacobi-Davidson, CG, GMRES,.....

#### Minimize memory footprint:

- Store only N<sub>nz</sub> nonzero elements of matrix with N<sub>r</sub> (number of matrix rows) entries
- "Sparse": N<sub>nz</sub> ~ N<sub>r</sub>

(assume square matrices  $N_C = N_R$ )



# SpMVM characteristics

- For large problems, SpMV is inevitably memory-bound
  - Intra-socket saturation effect on modern multicores



- SpMV is easily parallelizable in shared and distributed memory
  - Load balancing
  - Communication overhead
- Data storage format is crucial for performance properties
  - Most useful general format on CPUs: Compressed Row Storage (CRS)
  - May depend on compute architecture and problem (sparsity pattern)

## CRS matrix storage scheme



CRS data structure contains:

- val[] stores all the nonzeros (length  $N_{nz}$ )  $\rightarrow$  double
- col\_idx[] stores column index of each nonzero (length  $N_{nz}$ )  $\rightarrow$  int
- row\_ptr[] stores the starting index of each new row in val[] (length:  $N_r$ )  $\rightarrow$  int



## Case study: Sparse matrix-vector multiply

- Strongly memory-bound for large data sets
  - Mainly streaming data access (matrix data) mixed with partially indirect access (RHS data):

```
\label{eq:somp} \begin{tabular}{ll} !\$OMP parallel do schedule(???) \\ do i = 1,N_r \\ do j = row\_ptr(i), row\_ptr(i+1) - 1 \\ C(i) = C(i) + val(j) * B(col\_idx(j)) \\ enddo \\ enddo \\ !\$OMP end parallel do \\ \end{tabular}
```

- Usually many spMVs required to solve a problem
- Typical dimensions:  $N_R \approx 10^5, ..., 10^9 \& N_{NZR} \approx 10, ..., 100$
- Now let's look at some performance measurements...

#### Performance characteristics

 Strongly memory-bound for large data sets → saturating performance across cores on the chip

 Performance seems to depend on the matrix

- Can we explain this?
- Is there a "light speed" for SpMV?
- Optimization?



# SpMV node performance model – CRS (1)

```
do i = 1, N_r
do j = row_ptr(i), row_ptr(i+1) - 1
C(i) = C(i) + val(j) * B(col_idx(j))
enddo
enddo
```

```
real*8  val(N<sub>nz</sub>)
integer*4 col_idx(N<sub>nz</sub>)
integer*4 row_ptr(N<sub>r</sub>)
real*8  C(N<sub>r</sub>)
real*8  B(N<sub>c</sub>)
```

Min. load traffic [B]:  $(8 + 4) N_{nz} + (4 + 8) N_r + 8 N_c$ 

Min. store traffic [B]:  $8 N_r$ 

Total FLOP count [F]:  $2 N_{nz}$ 

$$B_{C,min} = \frac{12 N_{nz} + 20 N_r + 8 N_c}{2 N_{nz}} \frac{B}{F} = \frac{12 + 20/N_{nzr} + 8/N_{nzc}}{2} \frac{B}{F}$$
Nonzeros per row  $(N_{nzr} = N_{nz}/N_r)$  or column  $(N_{nzc} = N_{nz}/N_c)$ 

Lower bound for code balance:  $B_{C,min} \ge 6 \frac{\mathrm{B}}{\mathrm{F}} \longrightarrow I_{\mathrm{max}} \le \frac{1}{6} \frac{\mathrm{F}}{\mathrm{B}}$ 

# SpMV node performance model – CRS (2)

do i = 1, 
$$N_r$$
  
do j =  $row_ptr(i)$ ,  $row_ptr(i+1) - 1$   
 $C(i) = C(i) + val(j) * B(col_idx(j))$   
enddo  
enddo



$$B_C(\alpha) = \frac{12 + 20/N_{nzr} + 8 \alpha}{2} \frac{B}{F}$$





Parameter  $(\alpha)$  quantifies additional traffic for **B** (:) (irregular access):

$$\alpha \ge 1/N_{nzc}$$

$$\alpha N_{nzc} \geq 1$$

#### The " $\alpha$ effect"

#### CRS code balance

•  $\alpha$  quantifies the traffic for loading the Right Hand Side (RHS) vector

$$= \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F}$$

#### Can we predict $\alpha$ ?

- Not in general
- Simple cases (banded, block-structured): Similar to layer condition analysis
- $\rightarrow$  Determine  $\alpha$  by measuring the actual memory traffic ( $\rightarrow$  measured code balance  $B_C^{meas}$ )

# Determine $\alpha$ (RHS traffic quantification)

$$B_C(\alpha) = \left(6 + 4\alpha + \frac{10}{N_{nzr}}\right) \frac{B}{F} = \frac{V_{meas}}{N_{nz} \cdot 2 F} \quad (= B_C^{meas})$$

- $V_{meas}$  is the measured overall memory data traffic (using, e.g., likwid-perfctr)
- Solve for  $\alpha$ :

$$\alpha = \frac{1}{4} \left( \frac{V_{meas}}{N_{nz} \cdot 2 \text{ bytes}} - 6 - \frac{10}{N_{nzr}} \right)$$

Example: kkt\_power matrix from the UoF collection (one Intel SNB socket)

- $N_{nz} = 14.6 \cdot 10^6$ ,  $N_{nzr} = 7.1$
- $V_{meas} \approx 258 \text{ MB}$
- $\rightarrow \alpha = 0.36$ ,  $\alpha N_{nzr} = 2.5$
- → RHS is loaded 2.5 times from memory



11% extra traffic → optimization potential!

## Three different sparse matrices

Roofline performance prediction : 
$$P_{opt} = I * b_S = {}^{b_S}/_{B_{C,min}}$$

Benchmark system: Intel Xeon Ivy Bridge E5-2660v2, 2.2 GHz,  $b_S = 46.6 \, \mathrm{GB/s}$ 

| Matrix    | N         | $N_{nzr}$ | $B_{C,min}$ [B/F] | P <sub>opt</sub> [GF/s] |
|-----------|-----------|-----------|-------------------|-------------------------|
| DLR1      | 278,502   | 143       | 6.1               | 7.64                    |
| scai1     | 3,405,035 | 7.0       | 8.0               | 5.83                    |
| kkt_power | 2,063,494 | 7.08      | 8.0               | 5.83                    |







DLR1 scai1 kkt\_power

#### Now back to the start...



- $b_S = 46.6 \, \text{GB/s}$ ,  $B_c = 6 \, \text{B/F}$
- Maximum spMVM performance:

$$P_{max} = 7.8 \,\mathrm{GF/s}$$

**DLR1** causes (almost) minimum CRS code balance (as expected)

scai1 measured balance:

 $B_c^{meas} \approx 8.5 \text{ B/F} > B_{C,min}$  (6% higher than min)

- $\rightarrow$  good BW utilization, slightly non-optimal  $\alpha$
- kkt\_power measured balance:

 $B_c^{meas} \approx 8.8 \text{ B/F} > B_{C.min}$  (10% higher than min)

→ performance degraded by load imbalance, fix by block-cyclic schedule

## Investigating the load imbalance with kkt\_power



Measurements with likwid-perfctr (MEM\_DP group)

static

static,2048

- → Fewer overall instructions, (almost) BW saturation, 50% better performandce with load balancing
- → CPI value unchanged!



## SpMV node performance model – CPU



Intel Xeon Platinum 9242 24c@2.8GHz (turbo)  $b_S = 122 \ GB/s$ 

Matrices taken from: C. L. Alappat et al.: *ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX.* In print. Preprint: arXiv:2103.0301



# Data layout considerations – GPUs



## What about GPUs?

- GPUs need
  - Sufficient work per kernel launch in order to leverage their parallelism

 Coalesced access to memory (consecutive threads in a warp should access consecutive memory addresses)

- Plain CRS for SpMV on GPUs is not a good idea
  - Short inner loop
  - 2. Different amount of work per thread
  - 3. Non-coalesced memory access
- Remedy: Use SIMD/SIMT-friendly storage format
  - ELLPACK, SELL-C-σ, DIA, ESB,...



### What about GPUs?



- Each GPU thread computes one row, iterates over column indices
- This is the best mapping for CRS:
  - Enough parallelism to saturate the GPU (unless matrix is small)
  - Consecutive threads use similar data, spatial locality is used
  - No reduction among threads, each thread computes its own sum

But plain CRS has problems on GPUs!

# CRS SpMV in CUDA (y = Ax)

```
template <typename VT, typename IT>
global static void
spmv csr(const ST num rows,
         const IT * RESTRICT row ptrs, const IT * RESTRICT col idxs,
         const VT * RESTRICT values, const VT * RESTRICT x,
                                             VT * RESTRICT V)
   ST row = threadIdx.x + blockDim.x * blockIdx.x; // 1 thread per row
   if (row < num rows) {</pre>
       VT sum{};
       for (IT j = row ptrs[row]; j < row ptrs[row + 1]; ++j) {
            sum += values[j] * x[col idxs[j]];
       y[row] = sum;
```

$$B_c(\alpha) = \left(6 + 4 \alpha + \frac{6}{N_{nzr}}\right) \frac{B}{F}$$

No write-allocate on GPUs for consecutive stores

# SpMV CRS performance on a GPU



NVIDIA Ampere A100 Memory bandwidth  $b_S = 1400 \text{ GB/s}$ 

- Strong " $\alpha$  effect" large deviation from optimal  $\alpha$  for many matrices
  - Many cache lines touched b/c every thread handles one row → bad cache usage
- Mediocre memory bandwidth usage (<< 1400 GB/s) in many cases</li>
  - Non-coalesced memory access
  - Imbalance across rows/threads of warps

# CRS SpMV on GPUs: scattered loads





- Loads are executed in lock step on GPUs too
- GPUs prefer compact "coalescable" addresses for each load (i.e. consecutive access across threads)

#### CRS vs. GPU

- Row-wise storage format but access pattern orthogonal! → Scattered loads within warp
- Scattered loads need more cycles
- Scattered values occupy more cache lines
- Higher latencies and redundant data transfers

## CRS SPMV on GPUs – Problems: Idle threads



- Threads are grouped in warps
- Threads in a warp execute in lockstep, similar to SIMD
- Problem: loop over column indices can have different trip count for each vector
- Threads in a warp that have completed the loop are masked off
- All threads in a warp have to wait for the thread with most non-zeros

### SELL-C-σ

M. Kreutzer et al.: A Unified Sparse Matrix Data Format For Efficient General Sparse Matrix-vector Multiplication On Modern Processors With Wide SIMD Units, SIAM SISC 2014, DOI: 10.1137/130930352

#### Idea

- Sort rows according to length within sorting scope  $\sigma$
- Store nonzeros column-major in zero-padded chunks of height C





# SELL-C- $\sigma$ SpMV in CUDA (y=Ax)

```
template <typename VT, typename IT> global static void
spmv_scs(const ST C, const ST n_chunks, const IT * RESTRICT chunk_ptrs,
        const IT * RESTRICT chunk lengths, const IT * RESTRICT col idxs,
        const VT * RESTRICT values, const VT * RESTRICT x, VT * RESTRICT y)
  ST row = threadIdx.x + blockDim.x * blockIdx.x;
  ST c = row / C; // the no. of the chunk
  ST idx = row % C; // index inside the chunk
  if (row < n chunks * C) {
      VT tmp{};
      IT cs = chunk ptrs[c]; // points to start indices of chunks
      for (ST j = 0; j < chunk lengths[c]; ++j) {
          tmp += values[cs + idx] * x[col idxs[cs + idx]];
          cs += C;
      y[row] = tmp;
```

# Code balance of SELL-C- $\sigma$ (y=Ax)



When measuring  $B_C^{meas}$ , take care to use the "useful" number of flops (excluding zero padding) for work



PTfS 2024 July 1, 2024 29

# How to choose the parameters C and $\sigma$ on GPUs?

- **.** C
  - n× warp size to allow good utilization of GPU threads and cache lines

- **•** 0
  - As small as possible, as large as necessary
  - Large  $\sigma$  reduces zero padding (brings  $\beta$  closer to 1)
  - Sorting alters RHS access pattern  $\rightarrow \alpha$  depends on  $\sigma$



# SpMV node performance model – GPU



**NVIDIA Ampere A100** 

 $b_S = 1400 \, \text{GB/s}$ 



### SELL-C- $\sigma$ kernel on CPUs

#### Example C = 4 without further unrolling

```
for(i = 0; i < N/4; ++i)
  for(j = 0; j < cl[i]; ++j)
   y[i*4+0] += val[cs[i]+j*4+0] *
              x[col[cs[i]+j*4+0]];
   y[i*4+1] += val[cs[i]+j*4+1] *
              x[col[cs[i]+j*4+1]];
    y[i*4+2] += val[cs[i]+j*4+2] *
              x[col[cs[i]+j*4+2]];
    y[i*4+3] += val[cs[i]+j*4+3] *
              x[col[cs[i]+j*4+3]];
```

#### Choice of C for CPUs:

• (Multiple of) SIMD length

 $C = 4 \rightarrow AVX$  instructions

# SpMV node performance model – CPU



# Roofline analysis for spMVM

- Conclusion from the Roofline analysis
  - The roofline model does not "work" for spMVM due to the RHS traffic uncertainties
  - We have "turned the model around" and measured the actual memory traffic to determine the RHS overhead
  - Result indicates:
    - 1. how much actual traffic the RHS generates
    - 2. how efficient the RHS access is (compare BW with max. BW)
    - 3. how much optimization potential we have with matrix reordering
- Do not forget about load balancing!
- Sparse matrix times multiple vectors bears the potential of huge savings in data volume
- Consequence: Modeling is not always 100% predictive. It's all about learning more about performance properties!



# **BACKUP**







Applying sparse matrix to multiple vectors (Sparse Matrix Multiple Vectors: SpMMV)





## Multiple RHS vectors (SpMMV)

Unchanged matrix applied to multiple RHS (r) vectors to yield multiple LHS (r) vectors

```
do s = 1,r
do i = 1, Nr
do j = row_ptr(i),row_ptr(i+1)-1
   C(i,s) = C(i,s) + val(j) *
        B(col_idx(j),s)
   enddo
enddo
B_c unchanged, no
enddo
reuse of matrix data
```

```
do i = 1, Nr

do j = row_ptr(i),row_ptr(i+1)-1

do s = 1,r

C(i,s) = C(i,s) + val(j) *

B(col_idx(j),s)

enddo

enddo Higher B_c due to max

enddo reuse of matrix data
```

PTfS 2024 July 1, 2024 38

## SpMMV code balance

#### One complete inner (s) loop traversal:

- $\blacksquare$  2r flops
- 12 bytes from matrix data (value + index)
- $\frac{16r}{N_{nzr}}$  bytes from the r LHS updates
- $\frac{4}{N_{nzr}}$  bytes from the row pointer
- $8r\alpha(r)$  bytes from the r RHS reads

$$B_c(r) = \frac{1}{2r} \left( 12 + 8r\alpha(r) + \frac{16r + 4}{N_{nzr}} \right) \frac{B}{F}$$

$$= \left(\frac{6}{r} + 4\alpha(r) + \frac{8 + 2/r}{N_{nzr}}\right) \frac{B}{F}$$
 OK so what now???

```
do i = 1, Nr
do j = row ptr(i), row ptr(i+1)-1
 do s = 1,r
   C(s,i) = C(s,i) + val(j) *
            B(s,col idx(j))
  enddo
enddo
enddo
```

# SpMMV code balance

Let's check some limits to see if this makes sense!



M. Kreutzer et al.: Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems.

Proc. <u>IPDPS15</u>, <u>DOI: 10.1109/IPDPS.2015.76</u>

### SELL-C- $\sigma$ kernel on CPUs

#### Example C = 4 without further unrolling

```
for(i = 0; i < N/4; ++i)
  for(j = 0; j < cl[i]; ++j)
   y[i*4+0] += val[cs[i]+j*4+0] *
              x[col[cs[i]+j*4+0]];
   y[i*4+1] += val[cs[i]+j*4+1] *
              x[col[cs[i]+j*4+1]];
                                       C = 4
   y[i*4+2] += val[cs[i]+j*4+2] *
              x[col[cs[i]+j*4+2]];
    y[i*4+3] += val[cs[i]+j*4+3] *
              x[col[cs[i]+j*4+3]];
```

PTfS 2024 July 1, 2024 41