Hands-on #1: likwid-topology, likwid-pin, memory bandwidth
In this hands-on exercise you will compile and run a main memory bandwidth benchmark. You will learn how to explore node properties and topology with likwid-topology
and how to use likwid-pin
to explicitly control thread affinity.
Finally you learn how to determine the maximum sustained memory bandwidth for one socket and a complete node.
Preparation
You can find the benchmark code in the BWBENCH
folder of the teacher account (i.e., in ~bymmwitt/staging/BWBENCH
).
-
Get an interactive single-node job on Lise cluster for four hours (
-t 4:0:0
):$ srun --nodes=1 -t 4:0:0 --partition=standard96 --pty --interactive /bin/bash
-
Load Intel compiler and LIKWID modules:
$ module load intel/2021.2 likwid/5.2.1
-
Only on HLRN systems: unset
OMP_NUM_THREADS
, automatically set byHLRNenv
module:$ unset OMP_NUM_THREADS
Explore node topology
Execute likwid-topology:
$ likwid-topology # without graphical output $ likwid-topology -g # with graphical output
Answer the following questions:
- How many cores are available in one socket, the whole node?
- Is SMT enabled?
- What is the aggregate size of the last level cache in MB per socket?
- How many ccNUMA memory domains are there?
- What is the total installed memory capacity?
Compile the benchmark
Compile a threaded OpenMP binary with optimizing flags:
-
for C:
$ icc -Ofast -xHost -std=c99 -qopenmp -o bwBench-ICC bwBench.c
-
for Fortran:
$ ifort -Ofast -xHost -qopenmp -o bwBench-ICC bwBench.f90
Run the benchmark
Execute with 24 threads without explicit pinning:
$ OMP_NUM_THREADS=24 ./bwBench-ICC
Repeat multiple runs.
- Do the results fluctuate?
- By how much?
Run again with explicit pinning also using 24 threads but pinned to 24 physical cores of socket 0:
$ likwid-pin -c S0:0-23 ./bwBench-ICC
- Is the result different? If yes: why is it different?
- Can you recover the previous (best) result?
Benchmark the memory bandwidth scaling within one ccNUMA domain (in 1-core steps from 1 to 24 cores):
- What is the maximum memory bandwidth in GB/s?
- Which benchmark case reaches the highest bandwidth?
- At which core count can you saturate the main memory bandwidth?
- Does using the SMT threads help with anything?
Add some secret sauce:
Compile with the extra command-line options -qopt-streaming-stores=always -fno-inline
. What changes in the bandwidth readings? Can you explain what is going on?