



# Modern computer architecture

#### An introduction for software developers



## Multi-core today: Intel Xeon Ice Lake (2021)

- Xeon "Ice Lake SP" (Platinum/Gold/Silver/Bronze): Up to 40 cores running at 2+ GHz (+ "Turbo Mode" 3.7 GHz),
- Simultaneous Multithreading  $\rightarrow$  reports as 80-way chip
- ~15 Billion Transistors / ~10 nm / up to 270 W
- flexible ©



**Basic Node Architecture** 





# A deeper dive into core architecture





#### **Stored Program Computer**



#### From high level code to actual execution



#### General-purpose cache based microprocessor core



- Implements "Stored Program Computer" concept (Turing 1936)
- Similar designs on all modern systems
- (Still) multiple potential bottlenecks

The clock cycle is the "heartbeat" of the core







## **In-core features**

#### Pipelining, Superscalarity, SIMD, SMT



#### Important in-core features





#### Single Instruction Multiple Data:

Multiple operations per instruction



#### Simultaneous Multi-Threading: Multiple instruction sequences in parallel



## Instruction level parallelism (ILP): pipelining, superscalarity

#### Pipelining

Independent instructions (of one kind, e.g., ADD):

| 15 | 14 | 13 | 12 | 11 |
|----|----|----|----|----|
|    |    |    |    |    |

Single instruction takes 5 cycles (latency)



#### Throughput:

- 1 instruction per cycle after pipeline is full
- $\rightarrow$  5x speedup

Superscalar execution across multiple pipelines 4-way superscalar:



- → Massive boost in instruction throughput
- → Instructions can be reordered on the fly

#### Superscalar out-of-order execution and steady state



Hardware takes care of executing instructions as soon as their operands are available: Out-Of-Order (OOO) execution

## Simultaneous multi-threading (SMT)



#### **Basic Node Architecture**

### SIMD processing

- Single Instruction Multiple Data (SIMD) operations allow the execution of the same operation on "wide" registers from a single instruction
- x86 SIMD instruction sets:
  - SSE: register width = 128 Bit  $\rightarrow$  2 double precision floating point operands
  - AVX: register width = 256 Bit  $\rightarrow$  4 double precision floating point operands
  - AVX-512: ... you guessed it!
- Adding two registers holding double precision floating point operands:



## Single-core DP floating-point performance



## Multi-core today: Turbo mode

The processor dynamically overclocks to exploit more of the **TDP** envelope if fewer cores are active







# **Example: The sum reduction**



```
for (int i=0; i<N; i++) {
    sum += a[i];
}</pre>
```

...In single precision on an AVXcapable core (ADD latency = 3 cy)

How fast can this loop possibly run with data in the L1 cache?

- Loop-carried dependency on summation variable
- Execution stalls at every ADD until previous ADD is complete

→No pipelining?→No SIMD?

## Applicable peak for the sum reduction (I)



 $\rightarrow$  1/24 of ADD peak

## Applicable peak for the sum reduction (II)

Scalar code, 3-way "modulo variable expansion" LOAD r1.0  $\leftarrow$  0 LOAD  $r2.0 \leftarrow 0$ LOAD r3.0  $\leftarrow$  0 i ← 1 loop: LOAD r4.0  $\leftarrow$  a(i) LOAD r5.0  $\leftarrow$  a(i+1) LOAD r6.0  $\leftarrow$  a(i+2) ADD  $r1.0 \leftarrow r1.0 + r4.0 \# scalar ADD$ ADD  $r2.0 \leftarrow r2.0 + r5.0 \# scalar ADD$ ADD  $r3.0 \leftarrow r3.0 + r6.0 \# scalar ADD$ i+=3 **→**? loop result  $\leftarrow$  r1.0+r2.0+r3.0

for (int i=0; i<N; i+=3) {
 s1 += a[i+0];
 s2 += a[i+1];
 s3 += a[i+2];
}
sum = sum + s1+s2+s3;</pre>



 $\rightarrow$  1/8 of ADD peak

## Applicable peak for the sum reduction (III)

```
SIMD vectorization (8-way MVE) x
      pipelining (3-way MVE)
```

```
LOAD [r1.0, ..., r1.7] \leftarrow [0, ..., 0]
LOAD [r2.0, ..., r2.7] \leftarrow [0, ..., 0]
LOAD [r3.0, ..., r3.7] \leftarrow [0, ..., 0]
i ← 1
```

```
for (int i=0; i<N; i+=24) {</pre>
  s10 += a[i+0]; s20 += a[i+8]; s30 += a[i+16];
 s11 += a[i+1]; s21 += a[i+9]; s31 += a[i+17];
 s12 += a[i+2]; s22 += a[i+10]; s32 += a[i+18];
 s13 += a[i+3]; s23 += a[i+11]; s33 += a[i+19];
 s14 += a[i+4]; s24 += a[i+12]; s34 += a[i+20];
 s15 += a[i+5]; s25 += a[i+13]; s35 += a[i+21];
 s16 += a[i+6]; s26 += a[i+14]; s36 += a[i+22];
 s17 += a[i+7]; s27 += a[i+15]; s37 += a[i+23];
```

```
sum = sum + s10 + s11 + ... + s37;
```

|                                                                                                                        | s10 s20 s30 |
|------------------------------------------------------------------------------------------------------------------------|-------------|
| oop:                                                                                                                   | s11 s21 s31 |
| LOAD $[r4.0,,r4.7] \leftarrow [a(i),,a(i+7)] \# SIMD LOAD$<br>LOAD $[r5.0,,r5.7] \leftarrow [a(i+8),,a(i+15)] \# SIMD$ | s12 s22 s32 |
| LOAD $[r6.0,,r6.7] \leftarrow [a(i+16),,a(i+23)] \# SIMD$                                                              |             |
| ADD $r1 \leftarrow r1 + r4 \# SIMD ADD$                                                                                |             |
| ADD $r2 \leftarrow r2 + r5 \# SIMD ADD$                                                                                | s15 s25 s35 |
| ADD r3 $\leftarrow$ r3 + r6 # SIMD ADD                                                                                 | s16 s26 s36 |
| $i + = 24 \rightarrow ?$ loop<br>esult $\leftarrow$ r1 0+r1 1+ +r3 6+r3 7                                              | s17 s27 s37 |

ADD  $r1 \leftarrow r1 + r4 \# SIMD ADD$ ADD  $r2 \leftarrow r2 + r5 \# SIMD ADD$ ADD r3  $\leftarrow$  r3 + r6 # SIMD ADD i+=24 →? loop result  $\leftarrow$  r1.0+r1.1+...+r3.6+r3.7

loop:

# Sum reduction

#### Questions

- When can this performance actually be achieved?
  - No data transfer bottlenecks
  - No other in-core bottlenecks
    - Need to execute (3 LOADs + 3 ADDs + 1 increment + 1 compare + 1 branch) in 3 cycles
- What does the compiler do?
  - If allowed and capable, the compiler will do this automatically
- Is the compiler allowed to do this at all?
  - Not according to language standards
  - High optimization levels can violate language standards
- What about the "accuracy" of the result?
  - Good question ;-)





# **Memory Hierarchy**

#### In-cache performance (L2, L3) Main memory performance



## Von Neumann bottleneck reloaded: "DRAM gap"

DP peak performance and peak main memory bandwidth for a single Intel processor (chip)



**Basic Node Architecture** 

You can either build a small and fast memory or a large and slow memory



#### Purpose of many optimizations: use data in fast memory

**Basic Node Architecture** 

## Data transfers in a memory hierarchy

Caches help with getting instructions and data to the CPU "fast" How does data travel from memory to the CPU and back?

- Remember: Caches are organized in cache lines (e.g., 64 bytes)
- Only complete cache lines are transferred between memory hierarchy levels (except registers)
- Registers can only "talk" to the L1 cache
- MISS: Load or store instruction does not find the data in acache level
  - $\rightarrow$  CL transfer required



Example: Array copy A(:)=C(:)

## Avoiding the write-allocate transfer

#### Disadvantages of write-allocate:

- Cache pollution (if data not needed anytime soon)
- Additional data traffic

#### Solution 1: Nontemporal stores

- A.k.a. "streaming stores," store instruction with a "nontemporal hint"
- Write "directly" to memory, ignoring the normal cache hierarchy
- Avoids cache pollution, but stored data ends up in memory



Solution 2: Cache line claim

- Special instructions (e.g., on POWER, A64FX) or automatic in hardware (Arm, Intel Ice Lake)
- Core claims CL in some level when guranteed to be overwritten completely
- Allows stored data to remain in cache
   → does not reduce cache pollution





## Getting the data from far away







# **Multicore Chips**

Memory bandwidth scaling Node topology and performance



### Node topology of HPC systems



Basic Node Architecture

#### Putting the cores & caches together AMD Epyc 7742 64-Core Processor («Rome»)

- Core features:
  - Two-way SMT
  - Two 256-bit SIMD FMA units (AVX2)
     →16 flops/cycle
  - 32 KiB L1 data cache per core
  - 512 KiB L2 cache per core
- 64 cores per socket hierarchically built up from
  - I6 CCX with 4 cores and 16 MiB of L3 cache
  - 2 CCX form 1 CCD (silicon die)
  - 8 CCDs connected to IO device "Infinity Fabric" (memory controller & PCIe)
- 8 channels of DDR4-3200 per IO device
  - MemBW: 8 ch x 8 byte x 3.2 GHz = 204.8 GB/s
- ccNUMA feature (boot time option):
  - Nodes Per Socket (NPS)=1, 2 or 4
  - NPS=4  $\rightarrow$  4 ccNUMA domains



### Scalable and saturating behavior

Clearly distinguish between "saturating" and "scalable" performance on the chip level One of the most important performance signatures



parallel resources show scalable performance 3 5 6 7 Cores

8

## Parallelism in a modern compute node

Parallel and shared resources within a shared-memory node



#### **Parallel resources:**

- Execution/SIMD units 1
- Cores
- Inner cache levels 3
- Sockets / ccNUMA domains
- Multiple accelerators 5

#### **Shared resources:**

- Outer cache level per socket
- Memory bus per socket 7
- Intersocket link 8
- PCIe bus(es) 9
- Other I/O resources 10

#### How does your application react to all of those details?

4





# Interlude: A glance at accelerator technology

NVIDIA "Ampere" A100 vs. AMD Zen2 "Rome"



### Nvidia A100 "Ampere" SXM4 specs

#### Architecture

- 54.2 B Transistors
- ~ 1.4 GHz clock speed
- ~ 108 "SM" units
  - 64 SP "cores" each (FMA)
  - 32 DP "cores" each (FMA)
  - 4 "Tensor Cores" each
  - 2:1 SP:DP performance
- 9.7 TFlop/s DP peak (FP64)
- 40 MiB L2 Cache
- 40 GB (5120-bit) HBM2
- MemBW ~ 1555 GB/s (theoretical)
- MemBW ~ 1400 GB/s (measured)



© Nvidia



# Trading single thread performance for parallelism: *GPGPUs vs. CPUs*

| GPU vs. CPU<br>light speed estimate |                         | LU ALU     |                            |  |
|-------------------------------------|-------------------------|------------|----------------------------|--|
|                                     | Cache                   |            |                            |  |
|                                     | DRAM                    |            | DRAM                       |  |
|                                     | CPU                     |            | GPU                        |  |
|                                     | 2 x AMD EPYC 77         | 742 "Rome" | NVidia Tesla A100 "Ampere" |  |
| Cores@Clock                         | 2 x 64 @ 2.25 GHz       |            | 108 SMs @ ~1.4 GHz         |  |
| FP32 Performance/core               | 72 GFlop                | /s         | ~179 GFlop/s               |  |
| Threads@STREAM                      | ~16                     |            | ~ 100000                   |  |
| FP32 peak                           | 9.2 TFlop/s ~2          |            | ~19.5 TFlop/s              |  |
| Stream BW (meas.)                   | 2 x 180 GB/s ~4         |            | -4x 1400 GB/s              |  |
| Transistors / TDP                   | ~2x40 Billion / 2x225 W |            | 54 Billion/400 W           |  |





# Node topology and programming models



## Parallel programming models: Pure MPI



## Parallel programming models: Pure threading



#### **Conclusions about architecture**

- Performance is a result of
  - How many instructions you require to implement an algorithm
  - How efficiently those instructions are executed on a processor
  - Runtime contribution of the triggered data transfers
- Modern computer architecture has a rich "topology"
- Node-level hardware parallelism takes many forms
  - Sockets/devices CPU: 1-4 or more, GPGPU: 1-8
  - Cores moderate (CPU: 20-128, GPGPU: 10-100)
  - SIMD moderate (CPU: 2-16) to massive (GPGPU: 10's-100's)
  - Superscalarity (CPU: 2-6)
- Performance of programs is sensitive to architecture
  - Topology/affinity influences overheads of popular programming models
  - Standards do not contain (many) topology-aware features
    - Things are starting to improve slowly (MPI 3.0, OpenMP 4.0)
  - Apart from overheads, performance features are largely independent of the programming model