Hands-on: topology, affinity, memory bandwidth
In this hands-on exercise you will compile and run a main memory bandwidth benchmark. You will learn how to explore node properties and topology with likwid-topology and how to use likwid-pin to explicitly control thread affinity.
Finally you learn how to determine the maximum sustained memory bandwidth for one socket and a complete node.
Preparation
You can find the benchmark code in the BWBENCH folder of the teacher account.
- Get the source from the teacher's account:
$ cp -a ~g64g0000/BWBENCH ~
- Get an interactive single-node job on the Fritz cluster:
$ salloc -p singlenode -N 1 --time=01:00:00
- Load Intel compiler and LIKWID modules:
$ module load intel likwid
Explore node topology
Execute likwid-topology:
$ likwid-topology -g | less -S
Answer the following questions:
- How many cores are available in one socket, the whole node?
- Is SMT enabled?
- What is the aggregate size of the last level cache in MB per socket?
- How many ccNUMA memory domains are there?
- What is the total installed memory capacity?
Compile the benchmark
Compile a threaded OpenMP binary with optimizing flags:
$ icx -Ofast -xHost -std=c99 -qopenmp -o bwBench-ICC bwBench.c
Or, for Fortran:
$ ifx -Ofast -xHost -qopenmp -o bwBench-ICC bwBench.f90
Run the benchmark
BWBENCH runs a couple of different data-streaming loops with large arrays and reports the observed memory bandwidth per loop. Basically it's an improved version of the popular STREAM benchmark.
Execute with 18 threads without explicit pinning:
$ env OMP_NUM_THREADS=18 ./bwBench-ICC
Perform multiple runs.
- Do the results fluctuate?
- By how much?
Run again with explicit pinning also using 18 threads but pinned to 18 physical cores of socket 0:
$ likwid-pin -c S0:0-17 ./bwBench-ICC
- Is the result different? If yes: why is it different?
- Can you recover the previous (best) result?
Benchmark the memory bandwidth scaling within one ccNUMA domain (in 1-core steps from 1 to 18 cores):
- What is the maximum memory bandwidth in GB/s?
- Which benchmark case reaches the highest bandwidth?
- At which core count can you saturate the main memory bandwidth?
- Does the clock frequency impact the observed bandwidth numbers? Try setting 1.2 and 2.4 GHz with 1 and 18 threads, respectively.
Remember, in order to set the clock speed you have to wrap the command into srun:
$ srun --cpu-freq=1200000-1200000:performance likwid-pin -c S0:0-17 ./bwBench-ICC