Section: General | Core-Level Performance Engineering

Section outline

Select section General
While many developers put a lot of effort into optimizing large-scale parallelism, they often neglect the importance of an efficient serial code. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted because no definite hardware performance limit (“bottleneck”) is exhausted. This tutorial conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware on the level of a single CPU core and the lowest memory hierarchy level (the L1 cache). We introduce general out-of-order core architectures and their typical performance bottlenecks using modern x86-64 (Intel Ice Lake) and ARM (Fujitsu A64FX) processors as examples. We then go into detail about x86 and AArch64 assembly code, specifically including vectorization (SIMD), pipeline utilization, critical paths, and loop-carried dependencies. We also demonstrate performance analysis and performance engineering using the Open-Source Architecture Code Analyzer (OSACA) in combination with a dedicated instance of the well-known Compiler Explorer. Various hands-on exercises will allow attendees to make their own experiments and measurements and identify in-core performance bottlenecks. Furthermore, we show real-life use cases to emphasize how profitable in-core performance engineering can be.

Prerequisites:
a) It is recommended for attendees to have a basic understanding of the Roofline model. You can find some information here (lecture slides) and here (publication by S. Williams).
b) It is recommended for attendees to have some experience in using the Compiler Explorer. You can find a 30 min tutorial video (two parts) here:

Part 1

Part 2

This a full-day on-site tutorial at CGO26.

Lecturers: Jan Laukemann and Dr. Georg Hager

Course date: January 31, 2026

Course program:

8:45      Introduction
8:55      Basic processor and core architecture
▪ Intel Sapphire Rapids architecture
▪ Scheduling in an out-of-order backend
9:30      Terminology and code execution on out-of-order CPUs
▪ Throughput, Latency, Critical Path and Loop-carried Dependencies
▪ Hands-on: Out-of-order code execution
10:30    Break

11:00    x86 ISA introduction
▪ Understanding scalar and vectorized assembly code
11:45    Performance analysis of simple kernels
▪ Example: STREAM Triad
▪ Hands-on: Dot product
▪ Hands-on: PI by integration

12:45    Lunch
1:45      OSACA introduction
▪ How to use OSACA
▪ How to use the Compiler Explorer
▪ Analyze kernels using OSACA to find potential bottlenecks
2:45      In-core analysis for Arm
▪ Fujitsu A64FX core architecture
▪ AArch64 ISA introduction
▪ Understanding scalar and vectorized Arm assembly
3:00      Case study: Sparse Matrix-Vector (SpMV) Multiplication on A64FX

3:30      Break

4:00      Case study: Lattice Quantum Chromodynamics (QCD) on A64FX
4:30      Hands-on: 2D Gauss-Seidel on SPR
▪ Performance analysis
▪ Optimization techniques
▪ Performance impact of different compilers and flags
4:55      Summary and take-home messages
5:00      End of tutorial
- Select activity Course slides from the CGO26 full-day tutorial
  
  Course slides from the CGO26 full-day tutorial File PDF
- Select activity Hands-On Repository
  
  Hands-On Repository URL