PTfS25: Assignment 5: Multicore power, code balance

Opened: Wednesday, 4 June 2025, 12:00 AM

Due: Thursday, 12 June 2025, 10:05 AM

Multicore power envelope. We wish to understand why processor vendors trade clock frequency for concurrency when introducing multicore chips, so we establish a simple model for power dissipation. Assume that a certain chip (with a given manufacturing process, e.g., 10 nm) dissipates a power of \( W=W_d \) when running at its base frequency \( f_0\). Furthermore we assume that \(W\) is proportional to \( f^2 \).

(a) (5 crd) "Overclocking" is a popular strategy (especially among gamers, but also used for "Turbo Boost") to speed up the CPU. Calculate the power dissipation \(W\) when the clock frequency is increased by 40% (e.g., from 2.0 GHz to 2.8 GHz).
(b) (10 crd) If we require that the overall power dissipation of a chip shall be constant (\(W_d\)), we can trade clock speed for cores: By reducing the clock speed by some relative amount \(\Delta f/f_0\), we can put more cores (transistors) on the same chip and still stay within the power limit. Calculate the required clock speed reduction \(\Delta f/f_0\) for an \(m\)-core chip (instead of single core) with a power dissipation of \(W_d\). Calculate how many cores can be put on the chip (approximately) when the clock speed is cut in half.
(c) (5 crd) Now assume \(m=4\). Calculate the minimum and maximum performance gain (compared to the single-core chip with \(m\) times fewer transistors but the same power envelope, and the same memory interface) when using all four cores. For which types of code do you expect these extremal performance gains?
Optimal energy to solution. We want to model the power consumption of a 52-core processor under a certain workload, extending the crude power model from Task 1. We make the following assumptions:
- The chip dissipates a "baseline power" \(W_0\) when all cores are idle, independent of the clock frequency.
- If a core is executing instructions, it dissipates the additional ("dynamic") power \(W_d\), which adds to the chip's baseline power linearly with the number of cores.
- The performance of a certain code running on a single core of this processor is \(P(1)\).
- The maximum performance of the parallel version of this code is \(P_{limit}>P(1)\). Hence, when solving a given problem in parallel with \(n\) cores, the overall performance is \(P(n)=\min\left(nP(1),P_{limit}\right)\). This models a behavior where a code is limited by a bottleneck if it is running on multiple cores (i.e., the performance "saturates" at some point).
We want to solve a given problem with \(n=1\ldots 52\) cores and calculate the energy it takes to solve it. This is our "energy to solution" metric \(E(n)\) . You can assume that the amount of work to be done is normalized to 1, i.e., the time to solution on \(n\) cores is \(T(n)=1/P(n)\). For the following tasks you may find it helpful to use a spreadsheet program:

(a) (15 crd) Calculate \(E(n)\) for \(W_0=150 \,\mathrm W\), \(W_d=4 \,\mathrm W\), \(P(1)=1 \,\mathrm s^{-1}\), and \(P_{limit}=20 \,\mathrm s^{-1}\). Draw a diagram of \(E(n)\)(y-axis) vs. \(P(n)\) (x-axis), with \(n\) being the parameter along the curve (i.e., you will have 52 data points per series in this case). This is called a Z-plot. For which \(n_\min\) is \(E(n_\min)\) minimal? Can it make sense to use more than \(n_\min\) cores?
You can learn more about Z-plots in this blog post.

(b) (15 crd) Now assume we apply a code optimization (such as SIMD vectorization) that improves the single-core performance to \(P(1)=2.5\,\mathrm s^{-1}\), but \(P_{limit}\) is unaffected. How does that change \(n_\min\) and \(E(n_\min)\)? Draw the new function \(E(n)\) in the diagram from (a). What is the general conclusion from this result?

(c) (15 crd) Keeping \(P(1)\) as it is (i.e., the optimized value from (b)), we now perform an optimization that improves the saturated performance \(P_{limit}\) to \(30\,\mathrm s^{-1}\). How does that change \(n_\min\) and \(E(n_\min)\)? What happens if \(P_{limit}=200\,\mathrm s^{-1}\)? Draw both data sets into the diagram. What is the general conclusion from this result?
Balance. Calculate the in-memory code balance for loops with the following loop bodies, assuming that all arrays do not fit into any cache (a loop over i [and j, if present] is always implied, nontemporal stores do not apply):

(a) (5 crd) s = s + a[i]; (double precision)
(b) (5 crd) s = s + a[i]*b[i]; (double precision)
(c) (5 crd) a[i][j] = b[i][j]+ s*c[i][j]; (single precision, inner loop over j)
(d) (5 crd) a[i] = s*a[i]; (double precision)
(e) (15 crd) x[i] = x[i] + s*v[index[i]]; (DP for x[] and v[], and int64 for index[])

In case (e) you have to be a little imaginative; obviously it is impossible to give a single, definite answer...