Course: Core-Level Performance Engineering

Section outline

Select section General

Collapse Expand
General

Collapse all Expand all
While many developers put a lot of effort into optimizing large-scale parallelism, they often neglect the importance of an efficient serial code. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted because no definite hardware performance limit (“bottleneck”) is exhausted. This tutorial conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware on the level of a single CPU core and the lowest memory hierarchy level (the L1 cache). We introduce general out-of-order core architectures and their typical performance bottlenecks using modern x86-64 (Intel Ice Lake) and ARM (Fujitsu A64FX) processors as examples. We then go into detail about x86 and AArch64 assembly code, specifically including vectorization (SIMD), pipeline utilization, critical paths, and loop-carried dependencies. We also demonstrate performance analysis and performance engineering using the Open-Source Architecture Code Analyzer (OSACA) in combination with a dedicated instance of the well-known Compiler Explorer. Various hands-on exercises will allow attendees to make their own experiments and measurements and identify in-core performance bottlenecks. Furthermore, we show real-life use cases to emphasize how profitable in-core performance engineering can be.

Prerequisites:
a) It is recommended for attendees to have a basic understanding of the Roofline model. You can find some information here (lecture slides) and here (publication by S. Williams).
b) It is recommended for attendees to have some experience in using the Compiler Explorer. You can find a 30 min tutorial video (two parts) here:
Part 1

Part 2

This a half-day tutorial at ISC High Performance 2025 in Hamburg, Germany.

Lecturers: Jan Laukemann and Dr. Georg Hager

Course date: June 13, 2025

Course program:

Introduction

Basic processor and core architecture

Intel Ice Lake (Server) architecture

Scheduling in an out-of-order backend

Terminology and code execution on out-of-order CPUs

Throughput, Latency, Critical Path and Loop-carried Dependencies

x86 ISA introduction

Understanding scalar and vectorized assembly code

Performance analysis of simple kernels

STREAM Triad

Dot product

PI by integration

OSACA introduction

How to use OSACA

How to use the Compiler Explorer

Analyze kernels using OSACA to find potential bottlenecks
AArch64 ISA introduction

In-core performance engineering using OSACA

Sparse Matrix-Vector (SpMV) Multiplication on A64FX

Lattice Quantum Chromodynamics (QCD) on A64FX (optional)

2D Gauss-Seidel on ICX
- Select activity Course slides (update 2025-06-13 11:30 a.m. CET)
  
  Course slides (update 2025-06-13 11:30 a.m. CET) File
- Select activity Hands-On Repository
  
  Hands-On Repository URL
Select section Part 1

Collapse Expand
Part 1
- Select activity Hands-on #1: Dot product manual throughput analysis
  
  Hands-on #1: Dot product manual throughput analysis H5P
- Select activity Hands-on #2: Dot product measurement
  
  Hands-on #2: Dot product measurement Page
- Select activity Hands-on #3: Dot product with OSACA
  
  Hands-on #3: Dot product with OSACA Page
- Select activity Hands-on #4: PI by integration
  
  Hands-on #4: PI by integration Page
Select section Part 2

Collapse Expand
Part 2
- Select activity Hands-on #5: Gauss-Seidel
  
  Hands-on #5: Gauss-Seidel Page
Select section Supplementary material

Collapse Expand
Supplementary material
- Select activity OSACA
  
  OSACA URL
- Select activity uiCA
  
  uiCA URL
- Select activity LLVM-MCA
  
  LLVM-MCA URL
- Select activity IACA (EoL)
  
  IACA (EoL) URL
- Select activity Compiler Explorer
  
  Compiler Explorer URL
Select section Feedback Survey

Collapse Expand
Feedback Survey
Please fill out the feedback form at:
https://submissions.supercomputing.org/?page=SessionEval&new_year=sc24&id=sess434&eval_stype=stype393

Section outline

General

Part 1

Part 2

Supplementary material

Feedback Survey