NHR-GS-CW-22: Hands-on #1: likwid-topology, likwid-pin, memory bandwidth

In this hands-on exercise you will compile and run a main memory bandwidth benchmark. You will learn how to explore node properties and topology with likwid-topology and how to use likwid-pin to explicitly control thread affinity.

Finally you learn how to determine the maximum sustained memory bandwidth for one socket and a complete node.

Preparation

You can find the benchmark code in the BWBENCH folder of the teacher account (i.e., in ~bymmwitt/staging/BWBENCH).

Get an interactive single-node job on Lise cluster for four hours (-t 4:0:0):

$ srun --nodes=1 -t 4:0:0 --partition=standard96 --pty --interactive /bin/bash

Load Intel compiler and LIKWID modules:
```
$ module load intel/2021.2 likwid/5.2.1
```
Only on HLRN systems: unset OMP_NUM_THREADS, automatically set by HLRNenv module:
```
$ unset OMP_NUM_THREADS
```

Explore node topology

Execute likwid-topology:

$ likwid-topology    # without graphical output 
$ likwid-topology -g # with graphical output

Answer the following questions:

How many cores are available in one socket, the whole node?
Is SMT enabled?
What is the aggregate size of the last level cache in MB per socket?
How many ccNUMA memory domains are there?
What is the total installed memory capacity?

Compile the benchmark

Compile a threaded OpenMP binary with optimizing flags:

for C:

$ icc -Ofast -xHost -std=c99 -qopenmp -o bwBench-ICC bwBench.c

for Fortran:

$ ifort -Ofast -xHost -qopenmp -o bwBench-ICC bwBench.f90

Run the benchmark

Execute with 24 threads without explicit pinning:

$ OMP_NUM_THREADS=24 ./bwBench-ICC

Repeat multiple runs.

Do the results fluctuate?
By how much?

Run again with explicit pinning also using 24 threads but pinned to 24 physical cores of socket 0:

$ likwid-pin -c S0:0-23 ./bwBench-ICC

Is the result different? If yes: why is it different?
Can you recover the previous (best) result?

Benchmark the memory bandwidth scaling within one ccNUMA domain (in 1-core steps from 1 to 24 cores):

What is the maximum memory bandwidth in GB/s?
Which benchmark case reaches the highest bandwidth?
At which core count can you saturate the main memory bandwidth?
Does using the SMT threads help with anything?

Add some secret sauce:

Compile with the extra command-line options -qopt-streaming-stores=always -fno-inline. What changes in the bandwidth readings? Can you explain what is going on?

Last modified: Tuesday, 14 June 2022, 1:30 PM