PTfS25: Assignment 9: Stencils, OpenMP overhead, performance counters

Opened: Tuesday, 1 July 2025, 12:00 AM

Due: Thursday, 10 July 2025, 10:05 AM

Stencils. Consider the following stencil update sweep (all data is double precision):
```
#pragma omp parallel for schedule(static)
for(int j=1; j<N-2; ++j)
     for(int i=1; i<M-1; ++i) 
          y[j][i] = c * (x[j][i-1]
                       + x[j][i+1]
                       + x[j+1][i]
                       + x[j+2][i]
                       + x[j+2][i-1])
                    + f[j][i];
```
(a) (10 crd) Formulate the relevant layer condition that determines the code balance in a given memory hierarchy level (e.g., memory).
(b) (10 crd) Assuming that none of the arrays fit into any cache, calculate the best-case and worst-case in-memory code balance in byte/LUP.
(c) (10 crd) Calculate the absolute upper performance limit for an in-memory problem (i.e., none of the arrays fit into any cache) on a full Fritz socket (memory bandwidth 160 Gbyte/s)! Knowing the cache sizes of the CPU (if you don't remember, use likwid-topology to find out), what does the condition from (a) look like in this particular case? Hint: Take into account that the L3 cache on the Ice Lake CPUs is an exclusive victim cache, i.e., in order to calculate the available cache per core you can add the L2 and L3 cache sizes.
OpenMP overhead. Barrier synchronization and thread team "wakeup" time are major performance problems in many OpenMP programs. Here we want to quantify this overhead.

In Assignment 7.2 you investigated the OpenMP-parallel DAXPY benchmark and how OpenMP parallelization hurts performance for small loop lengths. Comparing the runtime of the serial code with the runtime of the parallel code at appropriate loop lengths will yield an estimate of the overhead. Set the CPU frequency to 2.0 GHz in all cases.

(a) (10 crd) Measure and report the serial (non-OpenMP) DAXPY performance at a loop length of 2000 with your code. Calculate how much time (in cycles) one run of the loop takes.
(b) (10 crd) Now measure and report the performance of the OpenMP-parallel DAXPY loop (using the fused "omp parallel for") with 1, 2, 4, and 18 OpenMP threads on a ccNUMA domain of a Fritz CPU. Which problem size is appropriate for comparing the runtimes and calculating the OpenMP overhead, and why?
(c) (20 crd) Use your measurements from (a) and (b) to calculate the OpenMP overhead in cycles for 1, 2, 4, and 18 threads.
Hardware performance counters. There will be an introduction to using hardware performance counters via likwid-perfctr in the lecture this week. Performance counters let us count certain events in the hardware. Here you will investigate the data transfers through the cache hierarchy for the Schönauer Vector Triad (a[:]=b[:]+c[:]*d[:]).

Note: Make sure to allocate your job with --constraint=hwperf to be able to read out performance counters!

(a) (10 crd) You can use your code from Assignment 2.3(a) as a starting point. Parallelize the benchmark loop with OpenMP and instrument it with LIKWID marker calls. Place the marker calls outside of the repetition loop (why?)
(b) (10 crd) Run the code with 1 thread and with 18 threads, using a loop length of 10⁸ on one ccNUMA domain of a Fritz CPU. Observe the overall data traffic to and from memory for both cases (using the likwid-perfctr MEM group). Does the measured data volume match your expectations? What is the qualitative difference between the two cases? What is the reason for this?
Note: Do not be alarmed by the fact that almost all cores have zero event counts for memory traffic. This is because there is only one counter for these events, and one core gets to count all the events.
(c) (10 crd) For the same problem size and 18 threads, measure the data traffic to and from the L3 and L2 caches and compare with the memory traffic. What is so special about the L3 cache?