Skip to main content
NHR Learning Platform
  • Home
  • More
You are currently using guest access
Log in
NHR Learning Platform
Home
Expand all Collapse all
  1. Dashboard
  2. PTfS25
  3. 19 May - 25 May
  4. Assignment 3: Pmax, data transfer

Assignment 3: Pmax, data transfer

Completion requirements
Opened: Tuesday, 20 May 2025, 12:00 AM
Due: Thursday, 29 May 2025, 10:05 AM
  1. Max in-core performance.
    (a) (20 credits) Calculate the maximum achievable in-core performance Pmax in Gflop/s of an AVX512-vectorized double-precision three-point stencil on an Ice Lake SP core at 1.9 GHz: 

    for(int i=0; i<N; ++i)
    for(int j=1; j<M-1; ++j)
    a[i][j] = 0.2*(b[i][j] + b[i][j-1] +b[i][j+1]);
    What fraction of the theoretical core peak performance is this?  Also calculate the expected Pmax for the AVX2 and scalar variants.
    Note: You may assume that every load in the high-level code is an actual load instruction on the machine-code level.

    (b) (10 credits) Assuming that nontemporal stores or cache line claim cannot be used, what is the data transfer in bytes per inner loop iteration if N and M are large so that the working set does not fit into any cache? What is thus the minimum code balance in byte/flop?

  2. Latency and bandwidth. Consider a CPU with an asymptotic memory bandwidth of \( b_S=250\,\mathrm{GB/s} \) and a memory latency of \( T_\ell=130\,\mathrm{ns} \). Assume that the duration of data transfer follows the Hockney Model as described in the lecture. 

    (a) (10 credits) Calculate the time of data transfer and the effective memory bandwidth for a message of size 4096 byte.
    (b) (10 credits) For general values of \( T_\ell \) and \( b_S \), calculate the message size \( N_{1/2}(T_\ell,b_S) \) at which \( B_\mathrm{eff}(N_{1/2})=b_S/2 \), i.e., at which the effective bandwidth is half the asymptotic bandwidth.
    (c) (10 credits) At a cache line size of 64 byte, calculate how many outstanding prefetches are needed to fully hide the memory latency and how much data (in bytes) must thus be kept "in flight."

  3. Strided access. Write a benchmark code for the double-precision "vector update" kernel  and modify it so that only each Mth element is used:

     


    for(i=0; i<N; i+=M) 
       a[i] = s * a[i]; 

     
    This is called a "strided loop" with stride M.
    (a) (20 credits) Plot the performance of this loop versus M∈{1,2,4,8} for N=108 on one Fritz core (do not forget to fix the clock frequency) and explain the change in performance with growing M. What happens if you increase M even further, using powers of two up to 220? Include that data into the plot.
    (b) (20 credits) Explore what happens if you do not choose powers of 2 for M. Use strides of M=8*1.2n, with n a positive integer and M<=106. Plot the data into the diagram from (a). Explain the difference in behavior.

    Note: Use the benchmarking harness from previous tasks to make sure that the benchmark loop is repeated often enough to get a runtime of at least 500 ms.

You are currently using guest access (Log in)
Data retention summary
Powered by Moodle