Assignment 6: OpenMP
Note 1: It is essential to take control of core-thread affinity in OpenMP programs. This means that it should not be left to chance where in the machine the threads of an OpenMP program are running. The OpenMP standard defines a way to bind threads to sockets, cores, or hardware threads. For example,
$ OMP_NUM_THREADS=10 OMP_PLACES=cores OMP_PROC_BIND=close ./a.out
binds the ten threads of the binary to the 10 "first" physical cores of the machine. On Fritz, these would be the ten first cores of a socket. The OMP_PLACES variable denotes the entities ("places") used for pinning threads. You can set it to "threads" (this means hardware [i.e., SMT] threads), "cores" (this means full cores), or "sockets" (this means full sockets). E.g., with OMP_PLACES=cores, each OpenMP thread will be bound to its own physical core. The OMP_PROC_BIND variable determines how the OpenMP threads are pinned to the places. Here "close" means to fill the places from "left to right," while "spread" keeps an even spacing between the OpenMP threads.
Further example:
$ OMP_NUM_THREADS=20 OMP_PLACES=cores OMP_PROC_BIND=spread ./a.out
will run 10 threads on one socket and 10 on the other socket of a Fritz node. In fact, the OpenMP runtime library will "spread out" the threads evenly across the node. There are more options in this scheme, but this information will be sufficient to get you going.
Note 2: The LIKWID tool suite is a collection of simple, easy-to-use tools that ease the handling of multicore nodes. You know likwid-perfctr already from previous assignments. The two most important other tools you may want to look at are likwid-topology and likwid-pin. You can watch a couple of short videos we made about these tools:
- (finding out about node topology)
- (enforcing thread affinity)
In order to use LIKWID on Fritz, you have to load the module first:
$ module load likwid
Then, to bind the threads of an OpenMP application to cores, you use likwid-pin as a wrapper:
$ likwid-pin -C S0:0-9 ./a.out
This has the same effect as the first example above: It pins the threads of a.out to the first ten physical cores of socket 0. Note that we did not have to set OMP_NUM_THREADS; likwid-pin will set it for you and infer its value from the pin mask given with the -C option. If you set it explicitly, likwid-pin will leave it alone. If the pin mask comprises more cores than OMP_NUM_THREADS, that's OK - only the first OMP_NUM_THREADS entries in the pin mask will be used then.
It does not matter whether you use OMP_PLACES/OMP_PROC_BIND or likwid-pin to pin your threads. Choose the mechanism that suits you best.
- Machine Topology.
The 64 Sapphire Rapids (SPR) nodes in the Fritz cluster use different CPUs than the ones you are used to (Ice Lake). You can allocate such a node by adding "-p spr1tb" to the sbatch or salloc command line. Use "likwid-topology -g" to find out about the CPUs and configuration of a Fritz SPR node.
(a) How many cores are there in one node? (5 crd)
(b) What are the L1/L2/L3 cache sizes? How many cores share an L1/L2/L3 cache segment? (5 crd)
(c) How many sockets and how many ccNUMA domains per socket does the node have? (5 crd) - Parallel Accumulate benchmark.
Now, let's switch back to the "normal" Ice Lake nodes in the Fritz CPU cluster. Parallelize the "Accumulate" benchmark loop (a[i] = a[i] + b[i]) from Assignment 2.2 with OpenMP. To do this, compile and link with the-qopenmpswitch and use the "fused"parallel fordirective to distribute iterations across threads:
#pragma omp parallel for for(int i=0; i<N; ++i) a[i] = a[i] + b[i];
For the following measurements, you can fix the clock speed (to 2 GHz) or you can also leave Turbo mode on. Whatever you do, document it in your solution.
Note: You have to pin the OpenMP threads to cores as shown above. Pinning takes time and typically happens when the first OpenMP parallel region is encountered. In order to eliminate this overhead, it is a good idea to have at least one parallel region execute before the actual benchmark loop to get the pinning overhead out of the way.
(a) (20 crd) Perform benchmark runs with this loop on one Ice Lake Fritz socket. Draw graphs of performance in Gflop/s versus loop length N for N = 101...108 and for 1,2,4,8,12, and 18 threads with "close" pinning (see above). Use a log scale on the x axis and a scaling factor of 1.2 from one N to the next. Draw them all in the same diagram.
Hint: Remember to use the benchmarking "harness" that we showed you; the whole loop should be repeated often enough to make the overall runtime long enough to be accurately measured (at least about 0.2 seconds).
(b) (10 crd) Now change the benchmark so that the repetition loop is within the parallel construct:
#pragma omp parallel { for(int k=0; k<NITER; ++k) { #pragma omp for for(int i=0; i<N; ++i) a[i] = a[i] + b[i]; } }Repeat the 18-thread run from above and draw the result in the same diagram. What changes do you observe compared to the 18-thread run with the fused "parallel for"? Do you have an explanation?
(c) (10 crd) For N=108 run the benchmark with 1,2,3,...,18 cores on one ccNUMA domain and draw a scaling graph of memory bandwidth, i.e., Gbyte/s vs. number of cores. What speedup do you observe from 1 to 18 cores? - Parallel \(\pi\).
Parallelize the "\(\pi\) by integration" code from Assignment 0 with OpenMP and run it with 1,2,3,...,36 threads on one socket of an Ice Lake node of Fritz (If you do not have the code, please use the attachedintegrate.cfile). Use a loop length of 2 x 109 and scalar code (i.e., compile with-O1 -no-vec). Run two experiments:
(a) (10 crd) Fix the clock speed to 2.0 GHz
(b) (5 crd) Use Turbo Mode (--cpu-freq=performance)
The "speedup" quantifies how much faster you can compute with n cores than with 1 core. Draw the speedup vs. number of threads for both runs in one diagram. Discuss the qualitative difference between the two scaling behaviors. What could be the reason for this difference? (10 crd)
Note: You have to pin the OpenMP threads to cores as shown above. Pinning takes time and typically happens when the first OpenMP parallel region is encountered. In order to eliminate this overhead, it is a good idea to have at least one parallel region execute before the actual benchmark loop to get the pinning overhead out of the way. - Resource-driven modeling. There are two fundamental bottlenecks that govern the runtime of a loop on a CPU chip: memory data transfer and instruction execution. Based on this insight, we can construct a simple model for the runtime of a loop.
Assume a loop of length N which, per iteration, requires a memory data transfer volume of V (in bytes) and performs W instructions. The CPU has a memory bandwidth of bS (in bytes/s) and a peak execution capability of pE (in instructions per cycle). The clock frequency is f. Assume that memory transfers and instruction execution are the only relevant resources.
(a) (10 crd) Using the information from above, calculate the expected execution time of the loop in cycles per iteration, assuming that code execution and data transfer cannot overlap.
(b) (5 crd) How does your answer to (a) change if you assume that there is full overlap between execution and data transfer?
(c) (5 crd) Construct a model for execution performance (instructions per second) from each of the cases (a) and (b).
- 10 June 2026, 2:36 PM