PTfS25: Assignment 4: Data transfer | NHR Learning Platform

Opened: Monday, 5 May 2025, 12:00 AM

Due: Thursday, 5 June 2025, 10:05 AM

Peak performance and bandwidth. Floating-point computations and memory bandwidth are important resources of a processor. Both can be used to derive upper limits for code performance. Assume a processor chip with the following properties:
- Clock frequency of 2.0 GHz
- 48 cores
- AVX-512 instruction set (512-bit wide SIMD registers); each core is capable of retiring two full-width double-precision FMA instructions per cycle
- Memory: 12-channel DDR4 DRAM, 3200 MHz
(a) (5 crd) Calculate the double-precision floating-point peak performance of the chip in Tflop/s
(b) (5 crd) Calculate the theoretical memory bandwidth of the chip in Gbyte/s
(c) (20 crd) Consider the following "STREAM Triad" loop on double-precision data:
```
for(int i=0; i<10⁹; ++i)
    a[i] = b[i] + s * c[i];
```
Assume that nontemporal stores cannot be used here, and that the calculation can be distributed evenly across all cores of the chip. Calculate the minimum time it takes to do the arithmetic in this loop (considering only the floating-point units and no other bottlenecks) and the minimum time it takes to transfer the data over the memory bus. From these numbers, what would you conclude is the relevant performance bottleneck when executing the loop on the above chip?

Is the assumption that the cores' peak performance is a bottleneck justified? Describe how one would have to refine the model to make it more realistic.
Outer product. The following C loop nest computes the outer product of two vectors x[] and y[] in order to generate a matrix a[][]:
```
double a[][],x[],y[];
for(j=0; j<N; ++j) 
  for(i=0; i<N; ++i)
    a[j][i] = x[i] * y[j];
  
```
You can assume that N is "large," i.e., the matrix does not fit into any cache, and that all data in a, x, and y are double-precision floating-point numbers. This serial code is run on one core of a 24-core processor with cache sizes of 32 KiB (per-core L1), 1 MiB (per-core L2), and 27 MiB (shared L3). The L3 cache is an exclusive victim cache and all data coming from memory is loaded to the L3 first. This is all quite close to an Intel Xeon "Skylake" CPU.

(a) (10 crd) Identify spatial and temporal access locality in the code. Which data structures lead to which locality of access? Does this depend on N? If yes, how?
(b) (20 crd) Someone says "To optimize the code, one must block the loop nest so that the matrix is divided into tiles. The tile size must be chosen so that one matrix tile fits into the cache."
Extend the loop nest above to implement this strategy. For simplicity you can assume quadratic tiles (block sizes in i and j directions are the same).
Calculate the necessary tile size to block for the L2 cache.
(c) (20 crd) At N=10000, will the tiling improve the performance compared to the "untiled" code? For the untiled and tiled code, calculate how much data must be transferred to and from main memory, L3, L2, and L1 per iteration (i.e., calculate the code balance in byte/it for all levels of the memory hierarchy)! Assume that NT stores cannot be used.
Will the tiling improve the performance? Explain why and evaluate the statement from (b).
Optimizing for code balance. Suppose you have a code like this:
```
for(i=0; i<N; ++i) {
  a[i] += b[i] * c[i];
}

for(i=0; i<N; ++i) {
  z[i] = a[i] * 0.5f;
}
```
All arrays hold single-precision floating-point numbers and N is "large." You can assume that write-allocate transfers do happen on store misses.

(a) (5 crd) Rewrite this code in a way that will improve its performance. What is the expected change in the memory code balance?
(b) (15 crd) Consider the optimized code on a hypothetical CPU architecture with the following properties:
* L1-L2 data path bandwidth 64 byte/cy half-duplex (i.e., it's either read or write of a cache line in any cycle)
* L2-L3 data path bandwidth 16 byte/cy half-duplex
* inclusive cache hierarchy
* Cache line length of 64 byte
* Maximum memory bandwidth 110 Gbyte/s (independent of RD vs. WR, half duplex)
* clock frequency 2.2 GHz
Calculate the time in CPU cycles per 16 iterations (1 cache line length) that it takes to transfer the required data over each of the data paths: L1-L2, L2-L3, L3-memory for a single core.