Assignment 5: OpenMP Basics
Note 1: It is essential to take control of core-thread affinity in OpenMP programs. This means that it should not be left to chance where in the machine the threads of an OpenMP program are running. The OpenMP standard defines a way to bind threads to sockets, cores, or hardware threads. For example,
$ OMP_NUM_THREADS=10 OMP_PLACES=cores OMP_PROC_BIND=close ./a.out
binds the ten threads of the binary to the 10 "first" physical cores of the machine. On Fritz, these would be the ten first cores of a socket. The OMP_PLACES variable denotes the entities ("places") used for pinning threads. You can set it to "threads" (this means hardware [i.e., SMT] threads), "cores" (this means full cores), or "sockets" (this means full sockets). E.g., with OMP_PLACES=cores, each OpenMP thread will be bound to its own physical core. The OMP_PROC_BIND variable determines how the OpenMP threads are pinned to the places. Here "close" means to fill the places from "left to right," while "spread" keeps an even spacing between the OpenMP threads.
Further example:
$ OMP_NUM_THREADS=20 OMP_PLACES=cores OMP_PROC_BIND=spread ./a.out
will run 10 threads on one socket and 10 on the other socket of a Fritz node. In fact, the OpenMP runtime library will "spread out" the threads evenly across the node. There are more options in this scheme, but this information will be sufficient to get you going.
Note 2: The LIKWID tool suite is a collection of simple, easy-to-use tools that ease the handling of multicore nodes. You know likwid-perfctr already from previous assignments. The two most important other tools you may want to look at are likwid-topology and likwid-pin. You can watch a couple of short videos we made about these tools:
- (finding out about node topology)
- (enforcing thread affinity)
In order to use LIKWID on Fritz, you have to load the module first:
$ module load likwid
Then, to bind the threads of an OpenMP application to cores, you use likwid-pin as a wrapper:
$ likwid-pin -C S0:0-9 ./a.out
This has the same effect as the first example above: It pins the threads of a.out to the first ten physical cores of socket 0. Note that we did not have to set OMP_NUM_THREADS; likwid-pin will set it for you and infer its value from the pin mask given with the -C option. If you set it explicitly, likwid-pin will leave it alone. If the pin mask comprises more cores than OMP_NUM_THREADS, that's OK - only the first OMP_NUM_THREADS entries in the pin mask will be used then.
It does not matter whether you
use OMP_PLACES/OMP_PROC_BIND or likwid-pin to pin your threads. Choose
the mechanism that suits you best.
- Machine Topology.
The Fritz cluster actually has three partitions: The largest one comprises 992 dual Intel Xeon "Ice Lake" nodes with 2x72 cores each. The smaller ones have 48 and 16 Intel Xeon "Sapphire Rapids" nodes with 1 TiB and 2 TiB of memory, respectively (but both have the same CPU type). If you specify-p spr1tb
on the command line when submitting a job, you will get one of the 1 TiB Sapphire Rapids (SPR) nodes.
Use "likwid-topology -g" to find out about the organization of an SPR "Fritz" node.
(a) How many cores are there in one node? (5 crd)
(b) What are the L1/L2/L3 cache sizes? How many cores share an L1/L2/L3 cache segment? (10 crd)
(c) How many sockets and how many ccNUMA domains per socket does the node have? (5 crd) - Parallel STREAM benchmark.
Parallelize the STREAM Triad benchmark loop (a[i] = b[i]+s*c[i]
) with OpenMP. To do this, compile and link with the-qopenmp
switch and use the "fused"parallel for
directive to distribute iterations across threads:
#pragma omp parallel for for(int i=0; i<N; ++i) a[i] = b[i] + s * c[i];
For the following measurements, you can fix the clock speed (to 2 GHz) or you can also leave Turbo mode on. Whatever you do, document it in your solution.
Note: You have to pin the OpenMP threads to cores as shown above. Pinning takes time and typically happens when the first OpenMP parallel region is encountered. In order to eliminate this overhead, it is a good idea to have at least one parallel region execute before the actual benchmark loop to get the pinning overhead out of the way.
(a) (20 crd) Perform benchmark runs with this loop on one Ice Lake Fritz socket. Draw graphs of performance in Gflop/s versus loop length N for N = 101...108 and for 1,2,4,8,12,16,18 threads with "close" pinning (see above). Use a log scale on the x axis and a scaling factor of 1.2 from one N to the next. Draw them all in the same diagram.
Hint: Remember to use the benchmarking "harness" that we showed you; the whole loop should be repeated often enough to make the overall runtime long enough to be accurately measured (at least about 0.2 seconds).
(b) (15 crd) Discuss the scalability of the benchmark for different working set sizes (L1 cache, L2 cache, L3 cache, memory); i.e., can n cores really compute n times faster? What could be the reason if they can't? To do this, compare with the purely serial code (compiled without-qopenmp
). For which working set sizes do you get a "good" speedup with multiple threads?
(c) (15 crd) For N=108 run the benchmark with 1,2,3,...,n cores on one ccNUMA domain (n being the number of cores on the domain) and draw a scaling graph, i.e., performance in Gflop/s vs. number of cores. From the data, calculate the maximum memory bandwidth that you can achieve. - Parallel π.
Parallelize the "π by integration" code from Assignment 0 with OpenMP and run it with 1,2,3,...,36 threads on one socket of an Ice Lake node of Fritz. Use a loop length of 2 x 109 and scalar code (i.e., compile with-O1 -no-vec
). Run two experiments:
(a) (10 crd) Fix the clock speed to 2.4 GHz
(b) (10 crd) Use Turbo Mode (--cpu-freq=performance
)
The "speedup" quantifies how much faster you can compute with n cores than with 1 core. Draw the speedup vs. number of threads for both runs in one diagram. Discuss the qualitative difference between the two scaling behaviors. What could be the reason for this difference? (10 crd)
Note: You have to pin the OpenMP threads to cores as shown above. Pinning takes time and typically happens when the first OpenMP parallel region is encountered. In order to eliminate this overhead, it is a good idea to have at least one parallel region execute before the actual benchmark loop to get the pinning overhead out of the way.