PTfS25: Assignment 3: Pmax, data transfer

Opened: Tuesday, 20 May 2025, 12:00 AM

Due: Thursday, 29 May 2025, 10:05 AM

Max in-core performance.
(a) (20 credits) Calculate the maximum achievable in-core performance P_max in Gflop/s of an AVX512-vectorized double-precision three-point stencil on an Ice Lake SP core at 1.9 GHz:
```
for(int i=0; i<N; ++i)
  for(int j=1; j<M-1; ++j)
    a[i][j] = 0.2*(b[i][j] + b[i][j-1] +b[i][j+1]);
```
What fraction of the theoretical core peak performance is this? Also calculate the expected P_max for the AVX2 and scalar variants.
Note: You may assume that every load in the high-level code is an actual load instruction on the machine-code level.

(b) (10 credits) Assuming that nontemporal stores or cache line claim cannot be used, what is the data transfer in bytes per inner loop iteration if N and M are large so that the working set does not fit into any cache? What is thus the minimum code balance in byte/flop?
Latency and bandwidth. Consider a CPU with an asymptotic memory bandwidth of \( b_S=250\,\mathrm{GB/s} \) and a memory latency of \( T_\ell=130\,\mathrm{ns} \). Assume that the duration of data transfer follows the Hockney Model as described in the lecture.

(a) (10 credits) Calculate the time of data transfer and the effective memory bandwidth for a message of size 4096 byte.
(b) (10 credits) For general values of \( T_\ell \) and \( b_S \), calculate the message size \( N_{1/2}(T_\ell,b_S) \) at which \( B_\mathrm{eff}(N_{1/2})=b_S/2 \), i.e., at which the effective bandwidth is half the asymptotic bandwidth.
(c) (10 credits) At a cache line size of 64 byte, calculate how many outstanding prefetches are needed to fully hide the memory latency and how much data (in bytes) must thus be kept "in flight."
Strided access. Write a benchmark code for the double-precision "vector update" kernel and modify it so that only each Mth element is used:
```
for(i=0; i<N; i+=M) 
   a[i] = s * a[i]; 
```
```
 
```
This is called a "strided loop" with stride M.
(a) (20 credits) Plot the performance of this loop versus M∈{1,2,4,8} for N=10⁸ on one Fritz core (do not forget to fix the clock frequency) and explain the change in performance with growing M. What happens if you increase M even further, using powers of two up to 2²⁰? Include that data into the plot.
(b) (20 credits) Explore what happens if you do not choose powers of 2 for M. Use strides of M=8*1.2ⁿ, with n a positive integer and M<=10⁶. Plot the data into the diagram from (a). Explain the difference in behavior.

Note: Use the benchmarking harness from previous tasks to make sure that the benchmark loop is repeated often enough to get a runtime of at least 500 ms.