In all modern HPC systems, the compute node is where code is executed
and "performance is generated." Hence, this is where a deep
understanding of the performance issues of any application must start.
At first glance, computer architecture appears extremely intricate,
making it next to impossible to derive general rules for good
performance. However, on closer inspection it turns out that there is a
surprisingly small number of guiding principles which govern most of the
performance behavior of HPC codes.
This online tutorial wants to convey those components of compute node
architecture that are most relevant for performance in HPC. We start
with the core level and cover code execution via pipelining and
out-of-order processing, Single Instruction Multiple Data (SIMD), and
Simultaneous Multi-Threading (SMT). Advancing through the memory
hierarchy, we look at cache hierarchies, main memory, and cache-coherent
non-uniform memory (ccNUMA) architecture. The commonalities and
differences between CPUs and GPUs are clearly described. Using simple
compute kernels from computational science, we show how architectural
features interact with code. We also introduce the Roofline performance
model as a simple way to formulate quantitative performance
expectations, compare them with observations, and derive possible
optimizations. Simple performance tools are introduced that favor
insight instead of automation.
To make this online event interactive, several online quizzes are
interspersed with lectures. Participants can also solve exercise
problems using H5P online content and our interactive "Layer Condition
Calculator" for stencil codes.