Assignment 8: Stencils, multicore power
Completion requirements
Opened: Wednesday, 26 June 2024, 12:00 AM
Due: Thursday, 4 July 2024, 10:03 AM
- Stencils. Consider the following stencil update sweep (all data is double precision):
#pragma omp parallel for schedule(static)
for(int j=0; j<N-2; ++j) for(int i=1; i<M-1; ++i) y[j][i] = c * (x[j][i-1] + x[j][i+1] + x[j+1][i] + x[j+2][i] + x[j-1][i]);
(a) (15 crd) Formulate the relevant layer condition that determines the code balance in memory. Which cache(s) is/are relevant? How would that layer condition change if we chose "static,1" instead of "static" as a loop schedule?
(b) (10 crd) Assuming that none of the arrays fit into any cache, calculate the best-case and worst-case in-memory code balance (for "static" scheduling).
(c) (10 crd) Calculate the absolute upper performance limit for an in-memory problem (i.e., none of the arrays fit into any cache) on a full Fritz socket (memory bandwidth 150 Gbyte/s) and static scheduling! Knowing the cache sizes of the CPU (if you don't remember, use likwid-topology to find out), what does the condition from (a) look like in this particular case? Hint: Take into account the the L3 cache on the Ice Lake CPUs is an exclusive victim cache, i.e., in order to calculate the available cache per core you can add the L2 and L3 cache sizes. - Optimal energy to solution. We want to model the power consumption of a 36-core processor under a certain workload, and make the following assumptions:
- The chip dissipates a "baseline power" \(W_0\) when all cores are idle, independent of the clock frequency.
- If a core is executing instructions, it dissipates the additional ("dynamic") power \(W_d\), which adds to the chip's baseline power.
- The performance of a certain code running on a single core of this processor is \(P(1)\).
- The
maximum performance of the parallel version of this code is
\(P_{limit}>P(1)\). Hence, when solving a given problem in parallel
with \(n\) cores, the overall performance is
\(P(n)=\min\left(nP(1),P_{limit}\right)\). This models a behavior where a
code is limited by a bottleneck if it is running on multiple cores
(i.e., the performance "saturates" at some point).
We want to solve a given problem with \(n=1\ldots 36\) cores and calculate the energy it takes to solve it. This is our "energy to solution" metric \(E(n)\) . You can assume that the amount of work to be done is normalized to 1, i.e., the time to solution on \(n\) cores is \(T(n)=1/P(n)\). For the following tasks you may find it helpful to use a spreadsheet program:
(a) (15 crd) Calculate \(E(n)\) for \(W_0=80 \,\mathrm W\), \(W_d=4 \,\mathrm W\), \(P(1)=1 \,\mathrm s^{-1}\), and \(P_{limit}=15 \,\mathrm s^{-1}\). Draw a diagram of \(E(n)\) vs. \(P(n)\), with \(n\) being the parameter along the curve. This is called a Z-plot. For which \(n_\min\) is \(E(n_\min)\) minimal? Can it make sense to use more than \(n_\min\) cores?
(b) (15 crd) Now assume we apply a code optimization (such as SIMD vectorization) that improves the single-core performance to \(P(1)=2.5\,\mathrm s^{-1}\), but \(P_{limit}\) is unaffected. How does that change \(n_\min\) and \(E(n_\min)\)? Draw the new function \(E(n)\) in the diagram from (a). What is the general conclusion from this result?
(c) (15 crd) Keeping \(P(1)\) as it is (i.e., the optimized value from (b)), we now perform an optimization that improves the saturated performance \(P_{limit}\) to \(25\,\mathrm s^{-1}\). How does that change \(n_\min\) and \(E(n_\min)\)? What happens if \(P_{limit}=72\,\mathrm s^{-1}\)? Draw both data sets into the diagram. What is the general conclusion from this result?
(d) (20 crd) Use your parallel "Pi by integration" code to measure the energy consumption and the CPU power vs. number of cores at fixed frequencies of 2.4 and 1.6 GHz on one Fritz socket. Instrument the numerical loop with LIKWID markers, compile with -O3 -xHost -qopenmp and use 2 x 10^9 iterations. Note that you can do the measurements with the ENERGY group from likwid-perfctr. Draw a Z-plot of energy vs. performance for the two frequencies with the number of cores as a parameter along each curve. Which frequency generates the least energy consumption, and how many cores do you need to use to get there? Discuss the tradeoff of energy vs. performance.
Note: If you haven't compiled with LIKWID markers before, use:icx -O3 -xHost -qopenmp -DLIKWID_PERFMON $LIKWID_INC mySource.c $LIKWID_LIB -llikwid
and do not forget to use the -m flag with likwid-perfctr, else your markers will have no effect.
If you don't remember anymore the introduction lecture on LIKWID, you can find an example of how to use the Marker API here: Tutorial Likwid Marker in C