Hands-on: likwid-topology, likwid-pin, memory bandwidth
In this hands-on exercise you will compile and run a main memory bandwidth benchmark. You will learn how to explore node properties and topology with likwid-topology and how to use likwid-pin to explicitly control thread affinity.
Finally you learn how to determine the maximum sustained memory bandwidth for one socket and a complete node.
Preparation
You can find the benchmark code in the BWBENCH folder of the teacher account. Get the source from the teacher's account:
$ cp -a ~r14s0001/BWBENCH ~
Explore node topology
Execute likwid-topology:
$ likwid-topology -g | less -S
(The "less -S" is for enabling horizontal panning because the output is too wide for most screens.) Answer the following questions:
- How many cores are available in one socket, the whole node?
- Is SMT enabled?
- What is the aggregate size of the last level cache in MB per socket?
- How many ccNUMA memory domains are there?
- What is the total installed memory capacity?
Compile the benchmark
(If not already done, perform module load intel
)
Compile a threaded OpenMP binary with optimizing flags:
$ icx -Ofast -xHost -std=c99 -qopenmp -o bwBench-ICC bwBench.c
Or, for Fortran:
$ ifx -Ofast -xHost -qopenmp -o bwBench-ICC bwBench.f90
Run the benchmark
BWBENCH runs a couple of different data-streaming loops with large
arrays and reports the observed memory bandwidth per loop. Basically
it's an improved version of the popular STREAM benchmark.
Execute with 18 threads without explicit pinning:
$ env OMP_NUM_THREADS=18 ./bwBench-ICC
Repeat multiple runs.
- Do the results fluctuate?
- By how much?
Run again with explicit pinning also using 18 threads but pinned to 18 physical cores of socket 0 (If not already done, perform module load likwid
):
$ likwid-pin -c S0:0-17 ./bwBench-ICC
- Is the performance different? If yes: why is it different?
- Can you recover the previous (best) performance result?
Benchmark the memory bandwidth scaling within one ccNUMA domain (in 1-core steps from 1 to 18 cores):
- What is the maximum memory bandwidth in GB/s?
- Which benchmark case reaches the highest bandwidth?
- At which core count can you saturate the main memory bandwidth?