Hands-On: Getting to know the system (job script)
In this hands-on exercise you will compile and run a main memory bandwidth benchmark. You will learn how to explore node properties and topology with likwid-topology and how to use likwid-pin to explicitly control thread affinity.
Finally you learn how to determine the maximum sustained memory bandwidth for one socket and a complete node.
Preparation
Copy all required file to your home. You can find the benchmark code in the BWBENCH folder.
$ cp -a ~dc-grub1/NLPE-Durham $HOME
$ cd NLPE-Durham/BWBENCH
Explore node topology
Submit job-dine2-part1.sh
$ sbatch job-dine2-part1.sh
Check the output
less -S job-dine2-part1sh.o*
(The "less -S" is for enabling horizontal panning because the output is too wide for most screens.)
Answer the following questions:
- How many cores are available in one socket, the whole node?
- Is SMT enabled?
- What is the aggregate size of the last level cache in MB per socket?
- How many ccNUMA memory domains are there?
- What is the total installed memory capacity?
Run the benchmark
BWBENCH runs a couple of different data-streaming loops with large arrays and reports the observed memory bandwidth per loop. Basically it's an improved version of the popular STREAM benchmark.
Submit job-dine2-part2.sh
$ sbatch job-dine2-part2.1.sh
The script executes with 16 threads without explicit pinning. Repeat multiple runs. Do the results fluctuate? What is the average bandwidth reading for, e.g., the Triad benchmark?
$ sbatch job-dine2-part2.2.sh
This runs BWBENCH again with explicit pinning also using 16 threads but pinned to 16 physical cores of socket 0 (If not already done, perform module load likwid):
- Is the performance different? If yes: why is it different?
- Can you recover the previous (best) performance result?
Benchmark the memory bandwidth scaling within one ccNUMA domain (in 1-core steps from 1 to 32 cores):
$ sbatch job-dine2-part3.sh
- What is the maximum memory bandwidth in GB/s?
- Which benchmark case reaches the highest bandwidth?
- At which core count can you saturate the main memory bandwidth?