



# "Simple" performance modeling: The Roofline Model

Loop-based performance modeling: Execution vs. data transfer

R.W. Hockney and I.J. Curington:  $f_{1/2}$ : A parameter to characterize memory and communication bottlenecks. Parallel Computing 10, 277-286 (1989). DOI: 10.1016/0167-8191(89)90100-2

W. Schönauer: <u>Scientific Supercomputing</u>: <u>Architecture and Use of Shared and Distributed Memory Parallel Computers</u>. Self-edition (2000)

S. Williams: <u>Auto-tuning Performance on Multicore Computers</u>. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008)



# Analytic white-box performance models

An analytic white-box performance model is a simplified mathematical description of the hardware and its interaction with software. It is able to predict the runtime/performance of code from "first principles."

## A simple performance model for loops

#### Simplistic view of the hardware:



Simplistic view of the software:

```
! may be multiple levels
do i = 1,<sufficient>
        <complicated stuff doing
        N flops causing
        V bytes of data transfer>
enddo
```

Computational intensity  $I = \frac{N}{V}$   $\rightarrow$  Unit: flop/byte

## Naïve Roofline Model

How fast can tasks be processed? *P* [flop/s]

#### The bottleneck is either

• The execution of work:  $P_{\text{peak}}$ 

• The data path:  $I \cdot b_S$  [flop/byte x byte/s]

[flop/s]

 $P = \min(P_{\text{peak}}, I \cdot b_S)$ 

This is the "Naïve Roofline Model"

- High intensity: P limited by execution
- Low intensity: P limited by data transfer
- "Knee" at  $P_{peak} = I \cdot b_S$ : Best use of resources
- Roofline is an "optimistic" model (think "light speed")



# The Roofline Model in computing – Basics

#### Apply the naive Roofline model in practice

Machine parameter #1:

Peak performance:

 $P_{peak} \left[ \frac{F}{s} \right]$ 

Machine model

Machine parameter #2:

Memory bandwidth:

 $b_S\left[\frac{B}{s}\right]$ 

Application model

Code characteristic:

Computational intensity: I

 $\left[\frac{F}{B}\right]$ 

Machine properties:

$$P_{peak} = 4 \frac{GF}{S}$$

$$b_S = 10 \frac{\text{GB}}{\text{S}}$$

Application property: I



## Prerequisites for the Roofline Model

- Data transfer and core execution overlap perfectly!
  - Either the limit is core execution or it is data transfer

- Slowest limiting factor "wins"; all others are assumed to have no impact
  - If two bottlenecks are "close," no interaction is assumed
- Data access latency is ignored, i.e. perfect streaming mode
  - Achievable bandwidth is the limit
- Chip must be able to saturate the bandwidth bottleneck(s)
  - Always model the full chip





## Roofline for architecture and code comparison

#### With Roofline, we can

- Compare capabilities of different machines
- Compare performance expectations for different loops

- Roofline always provides upper bound but is it realistic?
  - Simple case: Loop kernel has loop-carried dependencies → cannot achieve peak
  - Other bandwidth bottlenecks may apply



[Byte/s]

- 1.  $P_{\text{max}}$  = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily  $P_{\text{peak}}$ )
  - $\rightarrow$  e.g.,  $P_{\text{max}} = 176 \text{ GFlop/s}$
- 2.  $b_S$  = Applicable (saturated) peak bandwidth of the slowest data path utilized  $\rightarrow$  e.g.,  $b_S$  = 56 GByte/s
- 3.  $I = \text{Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance <math>B_C = I^{-1}$ )

 $\rightarrow$  e.g., I = 0.167 Flop/Byte  $\rightarrow B_C = 6$  Byte/Flop

$$P = \min(P_{\text{max}}, I \cdot b_S) = \min\left(P_{\text{max}}, \frac{b_S}{B_C}\right)$$
[Byte/Flop]

Roofline Model (c) NHR@FAU 2025

#### Full Roofline for the sum reduction from the intro

Example: do i=1,N; s=s+a(i); enddo

in single precision on an 8-core 2.2 GHz Sandy Bridge socket @ "large" N



## Complexities of in-core execution $(P_{\text{max}})$

#### Multiple bottlenecks:

- Decode/retirement throughput
- Port contention (direct or indirect)
- Arithmetic pipeline stalls (dependencies)
- Overall pipeline stalls (branching)
- L1 Dcache bandwidth (LD/ST throughput)
- Scalar vs. SIMD execution
- L1 Icache (LD/ST) bandwidth
- Alignment issues
- ...



Skylake

Tool for  $P_{\text{max}}$  analysis: OSACA

http://tiny.cc/OSACA

DOI: <u>10.1109/PMBS49563.2019.00006</u>

DOI: <u>10.1109/PMBS.2018.8641578</u>

| https://software.intel.com/content/www/us/en/develop/download/i |                                                         |            |
|-----------------------------------------------------------------|---------------------------------------------------------|------------|
| ö                                                               |                                                         |            |
|                                                                 |                                                         |            |
| 8                                                               |                                                         |            |
| Q                                                               | ᇷ                                                       |            |
| d                                                               | اڠ                                                      |            |
| <u></u>                                                         | ē                                                       |            |
| 9                                                               | ē                                                       |            |
| ğ                                                               | ē                                                       |            |
| S                                                               | Ī                                                       |            |
| )S                                                              | .0                                                      |            |
| $\mathbb{Z}$                                                    | ā                                                       |            |
| ⋛                                                               |                                                         |            |
| ⋛                                                               | 타                                                       |            |
| ₹                                                               | 엉                                                       |            |
| e                                                               | S-(                                                     |            |
| Ĭ                                                               | 9                                                       |            |
| ဗ                                                               | 댗                                                       |            |
| $\equiv$                                                        | 6                                                       |            |
| ᅙ                                                               | ١                                                       |            |
| <u></u>                                                         | 5                                                       |            |
| Œ                                                               | à                                                       |            |
| `.                                                              | 32                                                      |            |
| E<br>E                                                          | ä                                                       |            |
| 8                                                               | <u> </u>                                                |            |
|                                                                 | ĭ                                                       | 뒫          |
| )S                                                              | 10                                                      | <u>=</u>   |
| ?:                                                              | 64                                                      | US         |
| 2                                                               | 늉                                                       | a          |
| E                                                               | ntel-64-and-ia-32-architectures-optimization-reference- | manual.htm |
|                                                                 |                                                         |            |

#### Code balance: more examples

```
double a[], b[];
for(i=0; i<N; ++i)
    a[i] = a[i] + b[i];
double a[], b[];
for(i=0; i<N; ++i)
    a[i] = a[i] + (s) * b[i];
float s=0, a[];
for(i=0; i<N; ++i)
    s = (s) + a[i] * a[i];
float s=0, a[], b[];
for(i=0; i<N; ++i)
    s = (s) + a[i] * b[i];
```

```
B_{\rm C} = 24B / 1F = 24 B/F
I = 0.042 F/B
```

$$B_{\rm C} = 24$$
B / 2F = 12 B/F  
 $I = 0.083$  F/B

Scalar – can be kept in register

$$B_{\rm C} = 4B / 2F = 2 B/F$$
  
 $I = 0.5 F/B$ 

Scalar – can be kept in register

```
B_{\rm C} = 8B / 2F = 4 B/F
I = 0.25 F/B
```

Scalar – can be kept in register

```
float s=0, a[], b[];

for(i=0; i<N; ++i)

for(j=0; j<N; ++j)

b[i][j] = a[i][j]

+ a[i-1][j];

B_C = 16B/2F or even ???

8B/2F or even ???

20 B/2F
```

Streaming, perfect spatial locality, no temporal locality  $\rightarrow$  simple

And what about this?

```
float s=0, a[], b[];
int idx[];
for(i=0; i<N; ++i)
    s = s + a[i]
    * b[idx[i]];</pre>
```

Possible cache reuse → tricky!

We'll get to it!

# Is there anything to ease the construction of the model?

## Code balance $B_C$

- Close inspection and hard thinking
- Simplifying assumptions
  - "What is the minimum possible amount of traffic?"
  - "What is the worst case?"

- Tools
  - Kerncraft https://github.com/RRZE-HPC/kerncraft

## In-core $P_{\text{max}}$

- Inspection of assembly code and manual modeling
- Simplifying assumptions
  - "What is the required minimum number of arithmetic/load/store instructions?"
  - $P_{\text{max}} = P_{peak}$
- Tools
  - OSACA https://github.com/RRZE-HPC/OSACA

# Refined Roofline model: graphical representation

### Multiple ceilings may apply

- Different bandwidths / data paths
  - → different inclined ceilings
  - → possibly different *I* for one kernel
- Different P<sub>max</sub>
  - → different flat ceilings

In fact,  $P_{\text{max}}$  should always come from code analysis; generic ceilings are usually impossible to attain



### Tracking code optimizations in the Roofline Model

- Hit the BW bottleneck by good serial code
   (e.g., Ninja C++ → Fortran)
- 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking)
- 3. Increase intensity and go from memory bound to core bound (e.g., temporal blocking)
- 4. Hit the core bottleneck by good serial code (e.g., -fno-alias, SIMD intrinsics)



## Roofline: How can it "fail"?

... assuming that you did the math right?

- Load imbalance
  - May be impossible to saturate memory bandwidth
  - This includes serial code
- "Slow code"
  - "Invisible" performance ceiling due to inefficient instructions or inefficient execution





- Erratic memory access patterns
  - Latency rains on your parade

```
for(int i=0; i<N; ++i)
  a[i] = s * b[index[i]];</pre>
```



# Diagnostic / phenomenological Roofline modeling



# Diagnostic modeling

- What if we cannot predict the intensity/balance?
  - Code very complicated
  - Code not available
  - Parameters unknown
  - Doubts about correctness of analysis
- Measure data volume  $V_{meas}$  (and work  $N_{meas}$ )
  - Hardware performance counters
  - Tools: likwid-perfctr, PAPI, Intel Vtune,...



- Compare analytic model and measurement → validate model
- Can be applied (semi-)automatically
- Useful in performance monitoring of user jobs on clusters



# Roofline and performance monitoring of clusters

## Two cluster jobs...





Which of them is "good" and which is "bad"?



# Diagnostic modeling of a complex code (3 kernels)

Multiple bandwidth bottlenecks

 $\rightarrow$  need I for each one  $(I_{mem}, I_{L3}, I_{L2}, ...)$ 



#### Kernel 1

- Performance close to memory BW ceiling but far away from others
  - → indicates memory bound

#### Kernel 2

- Performance not near any BW ceiling
- Performance close to scalar peak ceiling
  - → indicates scalar core-bound peak code

#### Kernel 3

- Performance not anywhere near any ceiling
  - $\rightarrow$  There must be an (as yet) unknown in-core performance limit  $P_{\text{max}}$

## Roofline conclusion

- Roofline = simple first-principle model for upper performance limit of datastreaming loops
  - Machine model  $(P_{max}, b_S,...)$  + application model  $(I_{mem},...)$
  - Conditions apply, extensions exist
- Two modes of operation; per kernel:
  - Predictive: Calculate  $I_i$ , calculate upper limit, validate model, optimize, iterate
  - Diagnostic: Measure  $I_j$  and P, compare with ceilings
- Challenge of predictive modeling: Getting  $P_{max}$  and I right