Course: Node-Level Performance Engineering @HLRS

Section outline

Select section General

Collapse Expand
General

Collapse all Expand all
This course covers performance engineering approaches on the compute node level. Even application developers who are fluent in OpenMP and MPI often lack a good grasp of how much performance could at best be achieved by their code. This is because parallelism takes us only half the way to good performance. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted. This course conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware. This process must start at the core, socket, and node level, where the code gets executed that does the actual computational work. We introduce the basic architectural features and bottlenecks of modern processors and compute nodes. Pipelining, SIMD, superscalarity, caches, memory interfaces, ccNUMA, etc., are covered. A cornerstone of node-level performance analysis is the Roofline model, which is introduced in due detail and applied to various examples from computational science. We also show how simple software tools can be used to acquire knowledge about the system, run code in a reproducible way, and validate hypotheses about resource consumption. Finally, once the architectural requirements of a code are understood and correlated with performance measurements, the potential benefit of code changes can often be predicted, replacing hope-for-the-best optimizations by a scientific process. The focus of the last course day is on lectures and exercises using Score-P and Vampir for performance engineering, showing how these more traditional parallel performance analysis tools can be applied at the node level as well.

Lecturers: Georg Hager and Jan Eitzinger, Erlangen National High Performance Computing Center, Bill Williams, Center for Information Services and High Performance Computing (ZIH)

Course date: June 9-11, 2026 (9:00 am - 4:00 pm) and June 12, 2026 (9:00 am - 12:15 pm)

This course will be conducted online as a Zoom event. Details will be sent vie e-mail to registered participants.

Course Outline:

Introduction

Basic architecture of multicore systems: threads, cores, caches, sockets, memory

The important role of system topology

Tools topology and affinity in multicore environments

Overview

likwid-topology and likwid-pin

Tools: hardware performance counters

Why hardware performance counters?

likwid-perfctr

Applications

Roofline model: basics

Model assumptions and construction

Simple examples

Limitations of the Roofline model

Roofline case studies

Stencil algorithms

Tall & Skinny dense matrix-matrix multiplication

Sparse matrix-vector multiplication

Optimal use of parallel resources

Single Instruction Multiple Data (SIMD)

Cache-coherent Non-Uniform Memory Architecture (ccNUMA)

Basic skills in performance engineering

Performance Engineering using Score-P and Vampir

Analyzing MiniMD

Analyzing SpMV
- Select activity Course schedule (day 1-3)
  
  Course schedule (day 1-3) File
- Select activity Tools Day schedule (day 4)
  
  Tools Day schedule (day 4) File
- Select activity Tools day codes
  
  Tools day codes File GZ
- Select activity Vampir Demo+ Clients
  
  Vampir Demo+ Clients
  
  vampir-10.8.1-Demo+-linux-aarch64-setup.sh
  
  vampir-10.8.1-Demo+-linux-ppc64le-setup.sh
  
  vampir-10.8.1-Demo+-linux-x86_64-setup.sh
  
  Vampir-10.8.1-Demo+-macOS.dmg
  
  Vampir-10.8.1-Demo+-win64-setup.exe
  
  vampir.license
Select section Day 1

Collapse Expand
Day 1
- Select activity General intro
  
  General intro File
- Select activity Introduction to computer node architecture
  
  Introduction to computer node architecture File
- Select activity Hands-on: Logging in
  
  Hands-on: Logging in Page
- Select activity Hands-on: The divide instruction
  
  Hands-on: The divide instruction Page
- Select activity LIKWID tools: topology, affinity, clock speed
  
  LIKWID tools: topology, affinity, clock speed File
- Select activity Hands-on: likwid-topology, likwid-pin, memory bandwidth
  
  Hands-on: likwid-topology, likwid-pin, memory bandwidth Page
- Select activity LIKWID tools: hardware performance counters
  
  LIKWID tools: hardware performance counters File
- Select activity The Roofline model: Introduction
  
  The Roofline model: Introduction File
- Select activity Small affinity check application
  
  Small affinity check application File TGZ
Select section Day 2

Collapse Expand
Day 2
- Select activity Roofline case study: Stencils
  
  Roofline case study: Stencils File
- Select activity Hands-on: Performance counters and memory bandwidth
  
  Hands-on: Performance counters and memory bandwidth Page
- Select activity Hands-on: Dense matrix-vector multiplication
  
  Hands-on: Dense matrix-vector multiplication Page
- Select activity Roofline case study: Sparse matrix-vector multiplication (SpMV)
  
  Roofline case study: Sparse matrix-vector multiplication (SpMV) File
- Select activity Cache-coherent Non-Uniform Memory Architecture (ccNUMA)
  
  Cache-coherent Non-Uniform Memory Architecture (ccNUMA) File
Select section Day 3

Collapse Expand
Day 3
- Select activity Roofline case study: "Tall & Skinny" dense matrix-matrix multiplication
  
  Roofline case study: "Tall & Skinny" dense matrix-matrix multiplication File
- Select activity Hands-on: Matrix-free CG solver
  
  Hands-on: Matrix-free CG solver Page
- Select activity Optimized mfcg solution
  
  Optimized mfcg solution File C
- Select activity Matrix-free CG Hands-On Walkthrough (Fritz Icelake Node!)
  
  Matrix-free CG Hands-On Walkthrough (Fritz Icelake Node!) Page
- Select activity Single Instruction Multiple Data (SIMD)
  
  Single Instruction Multiple Data (SIMD) File
- Select activity Hands-on: Analyzing the MiniMD proxy app
  
  Hands-on: Analyzing the MiniMD proxy app Page
- Select activity Analysis spreadsheet template
  
  Analysis spreadsheet template File
- Select activity Performance Engineering basics
  
  Performance Engineering basics File
- Select activity The Bandwidth Benchmark GitHub Repository
  
  The Bandwidth Benchmark GitHub Repository URL
Select section Day 4

Collapse Expand
Day 4
For the tools day, please download and install the appropriate Vampir client from the "General" page for your local system.

The exercises can be done on "Woody" or "Fritz" with minor adjustments; the reservation should cover "Woody" for Friday, and the scripts are set up with that in mind.
- Select activity Trace-based performance engineering
  
  Trace-based performance engineering File
- Select activity Introduction to Score-P
  
  Introduction to Score-P File
- Select activity Trace analysis with Vampir
  
  Trace analysis with Vampir File
- Select activity Exercise: MiniMD Trace Collection
  
  Exercise: MiniMD Trace Collection Page
- Select activity Walkthrough: MiniMD Trace Collection
  
  Walkthrough: MiniMD Trace Collection Page
- Select activity Exercise: : Load imbalance: SMxV
  
  Exercise: : Load imbalance: SMxV Page
- Select activity Solution: Load imbalance: SMxV
  
  Solution: Load imbalance: SMxV Page
Select section Additional material

Collapse Expand
Additional material

Highlighted
Important links:

LIKWID tool suite: https://github.com/RRZE-HPC/likwid

LIKWID documentation Wiki: http://tiny.cc/LIKWID

Online Layer Condition calculator: http://tiny.cc/LayerConditions

Kerncraft automatic Roofline/ECM modeling tool: https://github.com/RRZE-HPC/kerncraft

MachineState system configuration script: https://github.com/RRZE-HPC/MachineState
Select section Feedback

Collapse Expand
Feedback
Please fill out the feedback form at:

https://survey.hlrs.de/index.php/865587

After you have filled out the form send an email to mailto:training@hlrs.de indicating you filled out the form.
You will also receive an email about the feedback form.

Section outline

General

Day 1

Day 2

Day 3

Day 4

Additional material

Feedback