PTfS26: Assignment 5: Multicore power and energy

Opened: Tuesday, 26 May 2026, 12:00 PM

Due: Thursday, 11 June 2026, 10:05 AM

Multicore power envelope. We wish to understand why processor vendors trade clock frequency for concurrency when introducing multicore chips, i.e., why chips with more cores usually run at lower frequencies. To this end, we establish a simple model for power dissipation. Assume that a certain chip (with a given manufacturing process, e.g., 10 nm) dissipates a power of \( W=W_d \) when running at its base frequency \( f_0\). Furthermore we assume that \(W\) is proportional to \( f^3 \).

(a) (5 crd) "Overclocking" is a popular strategy (especially among gamers, but also used for "Turbo Boost") to speed up the CPU. Calculate the power dissipation \(W\) when the clock frequency is increased by 30% (e.g., from 2.0 GHz to 2.6 GHz).
(b) (10 crd) If we require that the overall power dissipation of a chip shall be constant (\(W_d\)), we can trade clock speed for cores: By reducing the clock speed by some relative amount \(\Delta f/f_0\), we can put more cores (transistors) on the same chip and still stay within the power limit. Calculate the required clock speed reduction \(\Delta f/f_0\) for an \(m\)-core chip (instead of single core) with a power dissipation of \(W_d\). Calculate how many cores can be put on the chip (approximately) when the clock speed is cut in half.
(c) (5 crd) Now assume \(m=8\). Calculate the minimum and maximum performance gain (compared to the single-core chip with \(m\) times fewer transistors but the same power envelope, and the same memory interface) when using all eight cores. For which types of code do you expect these extremal performance gains?
(d) (10 crd) If all we care about was the energy needed to complete a computation (i.e., a program) on the chip, what would be the optimal frequency to choose according to the model above? You can assume a compute-bound workload, i.e. a workload whose runtime is inversely proportional to the clock frequency. You can compute this "energy to solution" by multiplying the power dissipation by the runtime.
Optimal energy to solution. The power/energy model in Task 1 was very crude, and we need some refinements. We want to model the power consumption of a 52-core processor under a certain workload, extending the model from Task 1 to be more realistic. We make the following assumptions:
- The whole chip dissipates a "baseline power" \(W_0\) when all cores are idle, independent of the clock frequency.
- If a core is executing instructions, it dissipates the additional ("dynamic") per-core power \(W_d\), which adds to the chip's baseline power linearly with the number of cores.
- The performance of a certain code running on a single core of this processor is \(P(1)\).
- The maximum performance of the parallel version of this code is \(P_{limit}>P(1)\). Hence, when solving a given problem in parallel with \(n\) cores, the overall performance is \(P(n)=\min\left(nP(1),P_{limit}\right)\). This models a behavior where a code is limited by a bottleneck if it is running on multiple cores (i.e., the performance "saturates" at some point).
We want to solve a given problem with \(n=1\ldots 52\) cores and calculate the energy it takes to solve it. This is our "energy to solution" metric \(E(n)\) . You can assume that the amount of work to be done is normalized to 1, i.e., the time to solution on \(n\) cores is \(T(n)=1/P(n)\). For the following tasks you may find it helpful to use a spreadsheet program:

(a) (15 crd) Calculate \(E(n)\) for \(W_0=120 \,\mathrm W\), \(W_d=4 \,\mathrm W\), \(P(1)=1 \,\mathrm s^{-1}\), and \(P_{limit}=20 \,\mathrm s^{-1}\). Draw a diagram of \(E(n)\)(y-axis) vs. \(P(n)\) (x-axis), with \(n\) being the parameter along the curve (i.e., you will have 52 data points per series in this case). This is called a Z-plot. For which \(n_\min\) is \(E(n_\min)\) minimal? Can it make sense to use more than \(n_\min\) cores?
You can learn more about Z-plots in this blog post.

(b) (15 crd) Now assume we apply a code optimization (such as SIMD vectorization) that improves the single-core performance to \(P(1)=2.5\,\mathrm s^{-1}\), but \(P_{limit}\) is unaffected. How does that change \(n_\min\) and \(E(n_\min)\)? Draw the new function \(E(n)\) in the diagram from (a). What is the general conclusion from this result?

(c) (15 crd) Keeping \(P(1)\) as it is (i.e., the optimized value from (b)), we now perform an optimization that improves the saturated performance \(P_{limit}\) to \(30\,\mathrm s^{-1}\). How does that change \(n_\min\) and \(E(n_\min)\)? What happens if \(P_{limit}=200\,\mathrm s^{-1}\)? Draw both data sets into the diagram. What is the general conclusion from this result?

(d) (15 crd) Now assume a purely core-bound workload (\( P_{limit}=\infty\)) whose performance is linear in the clock frequency. We also assume that the chip's power dissipation is \( W=W_0+nW_2f^2 \), i.e., the per-core dynamic power is \( W_d=W_2f^2 \). This is a realistic model for many contemporary high-end chips. Calculate the frequency \( f_{opt} \) at which the energy to solution is minimal. Interpret the result: What happens for very large and very small \(W_0\)?
Peak performance and bandwidth. Floating-point computations and memory bandwidth are important resources of a processor. Both can be used to derive upper limits for code performance. Assume a processor chip with the following properties:
- Clock frequency of 2.0 GHz
- 48 cores
- AVX-512 instruction set (512-bit wide SIMD registers); each core is capable of retiring two full-width double-precision FMA instructions per cycle
- Memory: 12-channel DDR4 DRAM, 3200 MHz
(a) (5 crd) Calculate the double-precision floating-point peak performance of the chip in Tflop/s
(b) (5 crd) Calculate the theoretical memory bandwidth of the chip in Gbyte/s
(c) (20 crd) Consider the following "STREAM Triad" loop on double-precision data:
```
for(int i=0; i<10⁹; ++i)
    a[i] = b[i] + s * c[i];
```
Assume that nontemporal stores can be used here, and that the calculation can be distributed evenly across all cores of the chip. Calculate the minimum time it takes to do the arithmetic in this loop (considering only the floating-point units and no other bottlenecks) and the minimum time it takes to transfer the data over the memory bus. From these numbers, what would you conclude is the relevant performance bottleneck when executing the loop on the above chip?

Is the assumption justified that the cores' peak performance can be a bottleneck? Describe how one would have to refine the model to make it more realistic.

(d) (10 crd) Assuming the bottleneck analysis above is correct, how much faster would the code be on an NVIDIA A100-SXM4 GPU (see slide 5 in lecture 6)?

Energy of \(\pi\). (20 crd) It is possible to measure the energy consumption of a program on the CPUs of the Fritz cluster. You have to submit your job with the extra option

-C
      hwperf

to salloc or sbatch. This enables the use of the performance counter infrastructure needed for doing energy measurements. In addition, you have to load the likwid module and wrap your binary into the likwid-perfctr tool:

      $ module load likwid
      $ srun <srun-options> likwid-perfctr -C 0 \
           -g ENERGY ./a.out <your_options>

This will run your code on core 0 of the first CPU and give you info about the power dissipation (output line "Power [W]") and the energy consumption (output line "Energy [J]") on the whole chip while your code was running. The output will look something like this:

Group 1: ENERGY
+-----------------------+---------+-------------+
|         Event         | Counter |  HWThread 0 |
+-----------------------+---------+-------------+
|   INSTR_RETIRED_ANY   |  FIXC0  |  3015640354 |
| CPU_CLK_UNHALTED_CORE |  FIXC1  |  4725394022 |
|  CPU_CLK_UNHALTED_REF |  FIXC2  |  9451143744 |
|     TOPDOWN_SLOTS     |  FIXC3  | 23626970110 |
|       TEMP_CORE       |   TMP0  |          36 |
|     PWR_PKG_ENERGY    |   PWR0  |    400.5558 |
|     PWR_PP0_ENERGY    |   PWR1  |           0 |
|    PWR_DRAM_ENERGY    |   PWR3  |     29.2418 |
|  PWR_PLATFORM_ENERGY  |   PWR4  |           0 |
|   UNCORE_CLOCKTICKS   | UBOXFIX |  8563231784 |
+-----------------------+---------+-------------+

+----------------------+------------+
|        Metric        | HWThread 0 |
+----------------------+------------+
|  Runtime (RDTSC) [s] |     4.2873 |
| Runtime unhalted [s] |     1.9689 |
|      Clock [MHz]     |  1199.9548 |
|  Uncore Clock [MHz]  |  1997.3679 |
|          CPI         |     1.5670 |
|    Temperature [C]   |         36 |
|      Energy [J]      |   400.5558 |
|       Power [W]      |    93.4294 |
|    Energy PP0 [J]    |          0 |
|     Power PP0 [W]    |          0 |
|    Energy DRAM [J]   |    29.2418 |
|    Power DRAM [W]    |     6.8206 |
|  Energy PLATFORM [J] |          0 |
|  Power PLATFORM [W]  |          0 |
+----------------------+------------+

Run your code in "Turbo Mode," with the base frequency of 2.4 GHz, and at 1.2 GHz in scalar and AVX-512 variants (see the previous assignment on how to do this) and report the energy consumption in all six cases. Which of the six variants consumed the least energy?