Hands-on: Performance counters and memory bandwidth
Task: Explore the behavior of a memory benchmark using likwid-perfctr
In this exercise you will analyze and predict the data access pattern of typical streaming patterns and validate your prediction with `likwid-perfctr` measurements.
Preparation
You can find the benchmark code in the BWBENCH folder of the teacher account. Copy it again since there might have been updates.
Investigate the benchmark code
Analyze the bwBench source code and derive the relation between read and write data volume for all benchmark cases.
Take into account possible write-allocate transfers!
Run benchmark
Data traffic analysis
Instrument the binary yourself using the LIKWID Marker API or use the provided bwBench-likwid.{c,f90}. Load the likwid and compiler modules:
$ module load likwid intel
Compile the code with:
$ icx -Ofast -xHost -fno-alias -std=c99 -qopenmp -DLIKWID_PERFMON ${LIKWID_INC} -o bwBench-perf bwBench-likwid.c ${LIKWID_LIB} -llikwid
or
$ ifx -Ofast -xHost -fno-alias -qopenmp ${LIKWID_INC} -o bwBench-perf bwBench-likwid.f90 ${LIKWID_LIB} -llikwid
These command lines use the module variables from the likwid module, so they are not portable to other systems.
First, allocate a cluster node and run on a single core with the MEM group:
$ salloc -p singlenode -N 1 --time=01:00:00 -C hwperf # last option necessary for likwid-perfctr
$ likwid-perfctr -g MEM -C S0:0 -m ./bwBench-perf
Look at the following derived metrics (concentrate on the COPY, TRIAD, and DAXPY loops):
- Memory read data volume
- Memory write data volume
- Overall memory bandwidth
Questions:
- Is there anything unexpected in the data?
- Does the memory bandwidth reported by the benchmark match the bandwidth measured by likwid-perfctr?
Now execute the benchmark using all cores on one ccNUMA domain:
$ likwid-perfctr -g MEM -C S0:0-17 -m ./bwBench-perfQuestions:
- Is anything different from the single-core run? (look at the data volumes!)
- What could have happened?