PTfS24: Assignment 4: Data transfers galore!

Opened: Tuesday, 28 May 2024, 12:00 AM

Due: Thursday, 6 June 2024, 10:03 AM

Peak performance and bandwidth. Floating-point computations and memory bandwidth are important resources of a processor. Both can be used to derive upper limits for code performance. Assume a processor chip with the following properties:
- Clock frequency of 2.4 GHz
- 64 cores
- AVX2 instruction set (256-bit wide SIMD registers); each core is capable of retiring two full-width double-precision FMA instructions per cycle
- Memory: 8-channel DDR4 DRAM, 3200 MHz
(a) (5 crd) Calculate the double-precision floating-point peak performance of the chip in Tflop/s
(b) (5 crd) Calculate the theoretical memory bandwidth of the chip in Gbyte/s
(c) (20 crd) Consider the following "STREAM Triad" loop on double-precision data:
```
for(int i=0; i<10⁹; ++i)
    a[i] = b[i] + s * c[i];
```
Assume that nontemporal stores cannot be used here, and that the calculation can be distributed evenly across all cores of the chip. Calculate the minimum time it takes to do the arithmetic in this loop (considering only the floating-point units and no other bottlenecks) and the minimum time it takes to transfer the data over the memory bus. From these numbers, what would you conclude is the relevant performance bottleneck when executing the loop on the above chip?
Outer product. The following C loop nest computes the outer product of two vectors x[] and y[] in order to generate a matrix a[][]:
```
double a[][],x[],y[];
for(j=0; j<N; ++j) 
  for(i=0; i<N; ++i)
    a[j][i] = x[i] * y[j];
  
```
You can assume that N is "large," i.e., the matrix does not fit into any cache, and that all data in a, x, and y are double-precision floating-point numbers. This serial code is run on one core of a 24-core processor with cache sizes of 32 KiB (per-core L1), 1 MiB (per-core L2), and 27 MiB (shared L3), a MULT peak performance of 32 GFlop/s per core, and a memory bandwidth of 104 GB/s. A core can execute two LOAD and one STORE instruction per cycle (throughput) in addition to two MULT instructions (and a couple of other stuff that is of no interest here). SIMD is supported with up to AVX-512. The MULT instruction has a latency of four cycles. This is all quite close to an Intel Xeon "Skylake" CPU.

(a) (10 crd) Identify spatial and temporal access locality in the code. Which data structures lead to which locality of access? Does this depend on N? If yes, how?

(b) (20 crd) Calculate how much data must be transferred to and from main memory per iteration! If this depends on N, quantify the dependence. Can nontemporal stores be used? If they can, how would this change the memory data traffic per iteration?
Optimizing for code balance. Suppose you have a code like this:
```
for(i=0; i<N; ++i) {
  a[i] += b[i] * c[i];
}
z[0] = 0.f;
for(i=1; i<N; ++i) {
  z[i] = a[i-1] * 0.5f;
}
```
All arrays hold single-precision floating-point numbers and N is "large." You can assume that write-allocate transfers do happen on store misses.

(a) (20 crd) Rewrite this code in a way that will improve its performance. What is the expected change in the memory code balance? Discuss if the optimized code is SIMD vectorizable.
(b) (20 crd) Consider the optimized code on a hypothetical CPU architecture with the following properties:
* L1-L2 data path bandwidth 64 byte/cy half-duplex
* L2-L3 data path bandwidth 16 byte/cy half-duplex
* inclusive cache hierarchy
* Cache line length of 64 byte
* Maximum memory bandwidth 110 Gbyte/s (independent of RD vs. WR, half duplex)
* clock frequency 2.2 GHz
Calculate the time in CPU cycles per 16 iterations (1 cache line length) that it takes to transfer the required data over each of the data paths: L1-L2, L2-L3, L3-memory for a single core.