



# Programming Techniques for Supercomputers: Performance Modelling

Motivation Roofline Model

<u>Prof. Dr. G. Wellein<sup>(a,b)</sup></u>, Dr. G. Hager<sup>(a)</sup> <sup>(a)</sup>HPC Services – Regionales Rechenzentrum Erlangen <sup>(b)</sup>Department für Informatik Friedrich-Alexander-Universität Erlangen-Nürnberg, Sommersemester 2025



A performance model brings together what you need (application requirements) and what you get (hardware capabilities)

A series of measurements from benchmarks is NOT a performance model\*

\*Bill Gropp, PASC2015

### Scope of the lecture – a typical example



# How model-building works: Physics



#### Code optimization/parallelization – no black boxes!



## Questions to ask in high performance computing

- Do I understand the performance behavior of my code?
  - Does the performance match a model I have made?
- What is the optimal performance for my code on a given machine?
  - High Performance Computing == Computing at the bottleneck
- Can I change my code so that the "optimal performance" gets higher?
  - Circumventing/ameliorating the impact of the bottleneck
- My model does not work what's wrong?
  - This is the good case, because you learn something
  - Performance monitoring / microbenchmarking may help clear up the situation
- Use your brain! Tools may help, but you do the thinking.





# "Simple" performance modeling: The Roofline Model

#### Loop-based performance modeling: Execution vs. data transfer





#### Simplistic view of hardware



## Naïve Roofline Model

What performance can the software achieve on a given hardware? P [flop/s]

#### The performance bottleneck is either

The execution of work (flops):

• The data path: (requested flops by incoming data)

[flop/s] <sup>P</sup>peak  $I \cdot b_S$ [flop/byte x byte/s]







### Roofline Model (RLM) – Basics Consider two bottlenecks only



- Hardware  $\rightarrow$  Peak performance:
- Hardware  $\rightarrow$  Peak memory bandwidth:
- Application/SW  $\rightarrow$  Computational Intensity: I

$$P_{peak} \left[\frac{F}{s}\right]$$
$$b_{S} \left[\frac{B}{s}\right]$$
$$I \left[\frac{F}{B}\right]$$

Machine model:  

$$P_{peak} = 3 \frac{GF}{s}$$

$$b_{s} = 10 \frac{GB}{s}$$

$$P = \min(P_{peak}, I * b_{s}) = \min(3 \frac{GF}{s}, 0.05 * 10 \frac{GF}{s}) = 0.5 \frac{GF}{s}$$

## The Roofline Model: A graphical view

н.

Plot max. attainable performance *P* as a function of *I* (application) for a given hardware  $\{P_{peak}, b_S\}$ 



## The Roofline Model – Basics

#### Compare capabilities of different machines



### The Roofline Model – Basics: Summary

 $P = \min(P_{peak}, I * b_S)$ 

Determine machine model for full chip/node/device:

- Peak performance  $P_{peak} = P_{chip} = n_{core} \cdot n_{super}^{FP} \cdot n_{FMA} \cdot n_{SIMD} \cdot f$
- Peak memory bandwidth: See fact sheet, e.g.  $b_S = #Channels \times f_{MEM} \times 8 \frac{B}{cvcle}$

#### So far the model is very restricted:

- Machine and application models are completely independent
- RLM always provides upper bound but is it realistic?
- Only two bottlenecks are considered
  - Peak Performance
  - Main memory transfers

```
double s=0, a[];
for(i=0; i<N; ++i) {
    s = s + a[i];}
```

- What if, e.g. there is no MULT and/or no SIMD vectorization?
  - $\rightarrow P_{peak}$  is not a realistic limit! Implementation may have lower "horizontal roof"

#### Machine model with $P_{peak}$ =4.5 TF/s and $b_s$ =300 GB/s



# **Roofline Model: Application information**



More realistic bounds for "bad" implementations





No SIMD, no pipelining, 1 FMA only  $\rightarrow$  64 x decrease in P<sub>Peak</sub>



Reality: Lower horizontal roofs (P<sub>peak</sub>) are typically not known



## The Roofline model: Extending more bottlenecks

Choose time based view:

Hardware bottlenecks impose upper (lower) performance (runtime) limits



\*Williams, Waterman, Patterson (2009), DOI: <u>10.1145/1498765.1498785</u>





### Roofline Model (RLM) – Refined Consider multiple independent bottlenecks



# The Roofline model: Extending more bottlenecks

Extend towards mutiple (independent) bottlenecks



 $\rightarrow$  Model very successfull if bottleneck can be saturated  $\rightarrow$  full CPU chip

\*Williams, Waterman, Patterson (2009), DOI: <u>10.1145/1498765.1498785</u>

# The Roofline Model – refined

- P<sub>max</sub> = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily P<sub>peak</sub>)
   → e.g., P<sub>max</sub> = 176 GFlop/s
- 2. *I* = Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance  $B_{\rm C} = I^{-1}$ ) → e.g., *I* = 0.167 Flop/Byte →  $B_{\rm C}$  = 6 Byte/Flop
- 3.  $b_{\rm S}$  = Applicable (saturated) peak bandwidth of the slowest data path utilized  $\rightarrow$  e.g.,  $b_{\rm S}$  = 56 GByte/s Expected performance:

$$P = \min(P_{\max}, I \cdot b_S) = \min\left(P_{\max}, \frac{b_S}{B_C}\right)$$
 [Byte/Flop]

R.W. Hockney and I.J. Curington: f<sub>1/2</sub>: A parameter to characterize memory and communication bottlenecks.

Parallel Computing 10, 277-286 (1989). DOI: 10.1016/0167-8191(89)90100-2

W. Schönauer: <u>Scientific Supercomputing: Architecture and Use of Shared and Distributed</u> <u>Memory Parallel Computers</u>. Self-edition (2000)

S. Williams: Auto-tuning Performance on Multicore Computers. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)

# The Roofline Model – getting it right

Applicable peak performance:  $P_{max} = n_{core} * P_{max}^{core}$ 

*P*<sup>core</sup><sub>max</sub> : single core maximum performance from L1: determine according to slides 22-41@03b\_05\_13-2025\_PTfS.pdf

#### Computational intensity: I

• Determine data transfer volume over slowest data path – for main memory:  $I = 1/B_c^{mem}$  (for  $B_c$  see 05\_05\_20-2025\_PTfS.pdf)

#### Applicable (saturated) peak bandwidth: $b_S$

- Determine with appropriate benchmark, e.g. for main memory choose the STREAM benchmark test that best matches your access pattern
  - See later for STREAM
- Or write own microbenchmark if relevant access pattern not available, e.g. read-only

## Realistic baseline for memory bandwidth: STREAM

- Assumption: STREAM (or similar, like vector triad) kernel benchmarks achieve an upper bandwidth limit from main memory
  - i.e., no code can draw more bandwidth
  - Theoretical BW limits are usually not achievable
  - Use STREAM as BW limit rather than the theoretical numbers!
- STREAM: <u>http://www.cs.virginia.edu/stream/</u>
  - Set of 4 standard benchmarks

COPY: A(:) = C(:) SCALE: A(:) = s \* C(:) ADD: A(:) = B(:) + C(:) TRIAD: A(:) = B(:) + s \* C(:)

- In practice, COPY & SCALE (ADD & TRIAD) draw the same bandwidth
- Advantage of STREAM: Many results published, well-defined benchmark
- Disadvantage of STREAM: Reported and actual BW numbers may differ

## STREAM: write-allocate and efficiency



|                                                                 |       | with write-allocate |         |                   | w/o write |                   |                      |  |
|-----------------------------------------------------------------|-------|---------------------|---------|-------------------|-----------|-------------------|----------------------|--|
|                                                                 | Туре  | reported            | actual  | $b_S/b_{\rm max}$ | reported  | $b_S/b_{\rm max}$ |                      |  |
|                                                                 | COPY  | 34079 <u>x3/2</u>   | → 51119 | 0.75              | 47281     | 0.69              | 14-core<br>(non-CoD) |  |
|                                                                 | SCALE | 33758 <u>x3/2</u>   | →50637  | 0.74              | 48025     | 0.70              | z 14<br> l (no       |  |
|                                                                 | ADD   | 38174 <sup></sup>   | →50899  | 0.75              | 51068     | 0.75              | 2.3 GHz<br>Haswell   |  |
|                                                                 | TRIAD | 38866 <u>x4/3</u>   | →51820  | 0.76              | 51107     | 0.75              | Ц С                  |  |
| gnize the benchmark and avoid the write-allocate                |       |                     |         |                   |           |                   |                      |  |
| gnize the benchmark and avoid the write-allocate 70-75% efficie |       |                     |         |                   |           |                   |                      |  |

State of the art compilers recog automatically

PTfS 2025





### Roofline Model (RLM) – Refined Arithmetic Intensity / Code Balance: Gymnastics



#### Arithmetic Intensity / Code Balance: Basic Examples



# Approaches to determine Computational Intensity

1. Analysis of loop body  $\rightarrow$  determine all load / stores that go to memory



- 3 LD (b, c, d) + 1 ST (a) + 1 WA (a) per iteration
  - Each LD / ST / WA is 8 Byte (double)

• 2 FLOP

• 
$$I = \frac{2 FLOP}{5*8 Byte} = \frac{1 FLOP}{20 Byte}$$
  $(B_C = \frac{20 Byte}{1 FLOP})$ 

- Cache vs. Memory Access??!! → DMVM; stencils, SpMV
- 2. Analysis of data structure  $\rightarrow$  Assume each element is touched only once

```
double a[N], b[N], c[N], d[N];
for(i=0; i<N; ++i)
{
        a[i] = b[i] + c[i] * d[i];
}</pre>
```

- 4 arrays (of size: N \* 8 Byte) + WA on a[]  $\rightarrow 2x$  $\rightarrow 5 * N * 8 Byte = 40 * N Byte$
- Total FLOP count: 2 \* N FLOP

• 
$$I = \frac{2 * N FLOP}{40 * N Byte} = \frac{1 FLOP}{20 Byte} \quad (B_C = \frac{20 Byte}{1 FLOP})$$

• Lower bound for memory traffic  $\rightarrow$  Upper bound for *I* 

# Approaches to determine Computational Intensity

```
double precison A(R,C), x(C), y(R)
...
do c = 1 , C
   tmp=x(c)
   do r = 1 , R
      y(r) = y(r) + A(r,c) * tmp
   enddo
enddo
```

Loop body analysis:

- LD A( r, c) to memory  $\rightarrow$  8 Byte
- $x(c) \leftrightarrow register \rightarrow 0$  Byte
- LD/ST y(r)  $\leftarrow \rightarrow$  Cache  $\rightarrow 0$  Byte

 $\rightarrow$  2 FLOP



#### Data structure analysis:

• A(R,C) • X(C) • Y(R): LD/ST  $\rightarrow 8 * R * C$   $\rightarrow 8 * R * C$   $\rightarrow 8 * R C$   $\rightarrow 8 * C$   $\rightarrow 2 * 8 * R$   $\rightarrow 2 * 8 * R$   $\rightarrow 2 * R * C$  FLOP  $= \frac{2 * R * C FLOP}{(R * C + C + 2 * R) * 8 Byte} = \frac{2 FLOP}{(1 + \frac{1}{R} + \frac{2}{C}) * 8 Byte} \approx \frac{2 FLOP}{8 Byte}$ R,C >> 1





### Roofline Model (RLM) – Refined Vector triads



### The Roofline Model – refined: Vector triads: $P_{max}$

Machine: 7 cores of Haswell@2.3GHz ( $n_{core} = 7; f = 2.3 \frac{Gcy}{c}$ ) 



AVX performance on 1 core Haswell / Broadwell

| Bottlene    | Execution Units / Ports |          |        |        |        |  |
|-------------|-------------------------|----------|--------|--------|--------|--|
|             | AVX ADD                 |          |        |        |        |  |
| • 2 AVX ite | AVX MULT                | AVX MULT |        |        |        |  |
| • 2 AVX ite | AVX FMA                 | AVX FMA  | AVX ST | AVX LD | AVX LD |  |
| iterations  |                         |          |        |        |        |  |

| AVX LD | AVX LD | AVX ST |         |                     |
|--------|--------|--------|---------|---------------------|
| AVX LD | AVX LD |        | AVX FMA | 2 AVX<br>iterations |
| AVX LD | AVX LD | AVX ST | AVX FMA |                     |

eck: LD

2 AVX iteration: 
$$T_{max}^{inst} = 3cy$$

• 
$$P_{max}^{core} = {}^{16F}/_{3cy} = 5.33 \ {}^{F}/_{cy}$$

#### The Roofline Model – refined: Vector triads: $I \cdot b_S \& P$

- Machine: 7 cores of Haswell @2.3 (CoD)
- STREAM triads BW:  $b_S = 29 \frac{GB}{s}$
- Computational Intensity (incl. WA; double precision):  $I = \frac{2F}{5*8R} = 0.05 \frac{F}{R}$



do i = 1, N

enddo

A(i) = B(i) + C(i) \* D(i)

$$P = \min\left(85.8\frac{GF}{s}, 0.05\frac{F}{B} * 29\frac{GB}{s}\right) = \min\left(85.8\frac{GF}{s}, 1.45\frac{GF}{s}\right) = \mathbf{1.45}\frac{GF}{s}$$

#### The Roofline Model – refined: Validate RLM

7 cores of Haswell @2.3 GHz (CoD)







### Roofline Model (RLM) – Refined Dense Matrix Vector Multiplication



## The Roofline Model – refined: Dense MVM : $P_{max}$

Machine: 7 cores of Haswell @2.3GHz ( $n_{core} = 7; f = 2.3 \frac{Gcy}{s}$ )

```
do c = 1 , C
   tmp=x(c)
   do r = 1 , R
      y(r)=y(r) + A(r,c) * tmp
   enddo
enddo
```

For 1 AVX iteration (r:r+3) 2 AVX LDs + 1 AVX ST + 1 AVX FMA

#### AVX performance on 1 core Haswell / Broadwell

AVX LD

AVX LD

| Execution Units / Ports |          |        |         |         |  |  |  |
|-------------------------|----------|--------|---------|---------|--|--|--|
|                         | AVX ADD  |        |         |         |  |  |  |
|                         | AVX MULT |        |         |         |  |  |  |
| AVX LD                  | AVX LD   | AVX ST | AVX FMA | AVX FMA |  |  |  |
|                         |          |        |         |         |  |  |  |

AVX ST

AVX FMA

1 AVX

iteration

Bottleneck: LD

• 1 AVX iteration: 
$$T_{max}^{inst} = 1cy$$

 1 AVX iteration → 4 loop iterations → 8 F

• 
$$P_{max}^{core} = {}^{8F}/_{1cy} = 8 {}^{F}/_{cy}$$

## The Roofline Model – refined: Dense MVM: $I \cdot b_S \& P$

- Machine: 7 cores of Haswell@2.3 GHz
- Read-OnlyBW:  $b_S = 32 \frac{GB}{S}$
- Computational Intensity (double precision):  $I = 1/B_C^{mem} = \frac{1}{\frac{B}{2F}} = 0.25 \frac{F}{R}$



Putting it together

$$P_{max} = 7 * 2.3 \frac{r}{s} * 8 \frac{r}{cy} = 128.8 \frac{r}{s}$$

$$P = \min\left(128.8\frac{GF}{s}, 0.25\frac{F}{B} * 32\frac{GB}{s}\right) = \min\left(128.8\frac{GF}{s}, 8\frac{GF}{s}\right) = 8\frac{GF}{s}$$



### The Roofline Model – refined: Dense MVM: Validate







## Roofline Model (RLM) – Refined Bad Code Implementation & Lower roofs



# A not so simple Roofline example

#### Example: do i=1,N; s=s+a(i); enddo

in single precision on a 2.2 GHz Sandy Bridge (3-stage FP add pipeline) socket @ "large" N



Plain scalar code, no SIMD

```
LOAD r1.0 \leftarrow 0

i \leftarrow 1

loop:

LOAD r2.0 \leftarrow a(i)

ADD r1.0 \leftarrow r1.0+r2.0

++i \rightarrow? loop

result \leftarrow r1.0
```



 $\rightarrow$  1/24 of ADD peak

# Applicable peak for the summation loop

### Scalar code, 3-way unrolling LOAD r1.0 $\leftarrow$ 0 LOAD r2.0 $\leftarrow$ 0 LOAD r3.0 $\leftarrow$ 0 $i \leftarrow 1$ loop: LOAD r4.0 $\leftarrow$ a(i) LOAD r5.0 $\leftarrow$ a(i+1) LOAD r6.0 $\leftarrow$ a(i+2) ADD $r1.0 \leftarrow r1.0 + r4.0$ ADD $r2.0 \leftarrow r2.0 + r5.0$ ADD $r3.0 \leftarrow r3.0 + r6.0$ $i+=3 \rightarrow ?$ loop result $\leftarrow$ r1.0+r2.0+r3.0

ADD pipes utilization:



→ 1/8 of ADD peak

# Applicable peak for the summation loop

### SIMD-vectorized, 3-way unrolled

```
LOAD [r1.0,...,r1.7] \leftarrow [0,...,0]
LOAD [r2.0,...,r2.7] \leftarrow [0,...,0]
LOAD [r3.0,...,r3.7] \leftarrow [0,...,0]
i \leftarrow 1
```

loop:

LOAD 
$$[r4.0,...,r4.7] \leftarrow [a(i),...,a(i+7)]$$
  
LOAD  $[r5.0,...,r5.7] \leftarrow [a(i+8),...,a(i+15)]$   
LOAD  $[r6.0,...,r6.7] \leftarrow [a(i+16),...,a(i+23)]$ 

ADD  $r1 \leftarrow r1 + r4$ ADD  $r2 \leftarrow r2 + r5$ ADD  $r3 \leftarrow r3 + r6$ 

i+=24 →? loop result ← r1.0+r1.1+...+r3.6+r3.7

#### ADD pipes utilization:





## Input to the roofline model

#### ... on the example of do i=1,N; s=s+a(i); enddo in single precision







Friedrich-Alexander-Universität Erlangen-Nürnberg

# Roofline Model (RLM) – Refined Summary



# Prerequisites for the Roofline Model

- Data transfer and core execution overlap perfectly!
  - Either the limit is core execution or it is data transfer
- Slowest limiting factor "wins"; all others are assumed to have no impact
  - If two bottlenecks are "close", no interaction is assumed
- Data access latency is ignored, i.e. perfect streaming mode
   Achievable bandwidth is the limit

- Chip must be able to saturate the bandwidth bottleneck(s)
  - Always model for full chip

# Factors to consider in the roofline model

#### Bandwidth-bound (simple case)

- Accurate traffic calculation (write-allocate, strided access, ...) → Intensity calculation
- Attainable ≠ theoretical BW

Core-bound (may be complex)

- Multiple bottlenecks: LD/ST, arithmetic, pipelines, SIMD, execution ports
- Limit is linear in # of cores (or clock speed)



#### Erratic access patterns may violate model assumptions

- Hit the BW bottleneck by good serial code (e.g., Ninja C++ → Fortran)
- 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking)
- 3. Increase intensity and go from memory bound to core bound (e.g., temporal blocking)
- 4. Hit the core bottleneck by good serial code (e.g., -fno-alias, SIMD intrinsics)



# Monitoring jobs running on Fritz in the Roofline diagram

#### Two cluster jobs...

Rooflines:  $P = min (P_{peak}, I * b_s)$ 



ClusterCockpit collects data and presents is

٠

# Shortcomings of the roofline model

- Saturation effects in multicore chips are not explained
  - Reason: Intra-Cache and memory transfers do (frequently) not overlap on a single core
     → Overlapp only between cores
  - Increase "pressure" on memory interface until it saturates  $\rightarrow$  bottleneck:  $b_s$
  - It is not sufficient to measure single-core STREAM to make it work
- In-cache performance is not correctly predicted
- The ECM performance model gives more insight:









Friedrich-Alexander-Universität Erlangen-Nürnberg

## Roofline Model (RLM) – Refined Code Balance and Machine Balance



# Machine balance for hardware characterization

• For quick comparisons the concept of machine balance is useful

$$B_m = \frac{b_S}{P_{\text{peak}}}$$

- Machine Balance = How much input data can be delivered for each FP operation? ("Memory Gap characterization")
  - Assuming balanced MULT/ADD
- Rough estimate:  $B_m \ll B_c \rightarrow$  strongly memory-bound code
- Typical values (main memory):

```
Intel Haswell 14-core 2.3 GHz<br/>B_m = 60 \text{ GB/s} / (14 \times 2.3 \times 16) \text{ GF/s} \approx 0.12 \text{ B/F}Intel Sandy Bridge 8-core 2.7 GHz\approx 0.23 \text{ B/F}Nvidia P100\approx 0.10 \text{ B/F}Intel Xeon Phi Knights Landing (HBM)\approx 0.16 \text{ B/F}
```



### Machine balance over time







Friedrich-Alexander-Universität Erlangen-Nürnberg

# **RLM Case Study**

Tall & Skinny Matrix-Transpose Times Tall & Skinny Matrix (TSMTTSM) Multiplication



### **TSMTTSM Multiplication**

- Block of vectors → Tall & Skinny Matrix (e.g.  $10^7 \times 10^1$  dense matrix)
- Row-major storage format
- Block vector subspace orthogonalization procedure requires, e.g. computation of scalar product between vectors of two blocks

• TSMTTSM Mutliplication  $K \gg N, M$ 

Assume:  $\alpha = 1$ ;  $\beta = 0$ 



### **TSMTTSM Multiplication**

 General rule for dense matrix-matrix multiply: Use vendor-optimized GEMM, e.g. from Intel MKL<sup>1</sup>:

$$C_{mn} = \sum_{k=1}^{N} A_{mk} B_{kn}$$
,  $m = 1...M, n = 1...N$ 

double P<sub>peak</sub> [GF/s]  $b_{s}[GB/s]$ Efficiency System Perf. Size SQ 160 GF/s 91% Intel Xeon E5 2660 v2 176 GF/s 52 GB/s 10c@2.2 GHz TS 16.6 GF/s 6% SQ 550 GF/s 95% Intel Xeon E5 2697 v3 582 GF/s 65 GB/s 14c@2.6GHz TS 22.8 GF/s 4% complex double TS@MKL: Matrix sizes: Good or bad? Square (SQ): M=N= K=15,000 Tall&Skinny (TS): M=N=16; K=10,000,000

<sup>1</sup>Intel Math Kernel Library (MKL) 11.3

### **TSMTTSM** Roofline model



### **TSMTTSM** Roofline performance prediction

Now choose 
$$M = N = 16 \Rightarrow I_d \approx \frac{16}{8} \frac{F}{B}$$
 and  $I_z \approx \frac{16}{4} \frac{F}{B}$ , i.e.  $B_d \approx 0.5 \frac{B}{F}$ ,  $B_z \approx 0.25 \frac{B}{F}$   
Intel Xeon E5 2660 v2  $(b_S = 52 \frac{GB}{s}) \Rightarrow P = I_d \times b_S = 104 \frac{GF}{s}$  (double)  
Measured (MKL): 16.6  $\frac{GF}{s}$ 

Intel Xeon E5 2697 v3 (
$$b_S = 65 \frac{GB}{s}$$
)  $\rightarrow P = I_Z \times b_S = 240 \frac{GF}{s}$  (double complex)  
Measured (MKL): 22.8  $\frac{GF}{s}$ 

### → Potential speedup: 6–10x vs. MKL

### Can we implement a better TSMTTSM kernel than Intel?



Not shown: Inner Loop boundaries (n,m) known at compile time (kernel generation) k assumed to be even

### TS: M=N=16; K=10,000,000

| System                | P <sub>peak</sub> / b <sub>S</sub> | Version | Performance | <b>RLM Efficiency</b> |
|-----------------------|------------------------------------|---------|-------------|-----------------------|
| Intel Xeon E5 2660 v2 | 176 GF/s<br>52 GB/s                | TS OPT  | 98 GF/s     | 94 %                  |
| 10c@2.2 GHz           |                                    | TS MKL  | 16.6 GF/s   | 16 %                  |
| Intel Xeon E5 2697 v3 | 582 GF/s                           | TS OPT  | 159 GF/s    | 66 %                  |
| 14c@2.6GHz            | 65 GB/s                            | TS MKL  | 22.8 GF/s   | 9.5 %                 |