PTfS26: Assignment 4: Memory hierarchy, square roots

Opened: Tuesday, 19 May 2026, 12:00 PM

Due: Wednesday, 27 May 2026, 10:05 AM

Optimizing for code balance. Suppose you have a code like this:
```
for(i=0; i<N; ++i) {
  a[i] += b[i] * c[i];
}

for(i=0; i<N; i+=2) {
  z[i] = a[i] * 0.5f;
}
```
All arrays hold single-precision floating-point numbers and N is "large." You can assume that write-allocate transfers do happen on store misses.

(a) (15 crd) Rewrite this code in a way that will improve its performance. Calculate the expected change in the memory code balance.
(b) (25 crd) Consider the optimized code on a hypothetical CPU architecture with the following properties:
* L1-L2 data path bandwidth 64 byte/cy half-duplex (i.e., it's either read or write of a cache line in any cycle)
* L2-L3 data path bandwidth 16 byte/cy half-duplex
* inclusive cache hierarchy
* Cache line length of 64 byte
* Maximum memory bandwidth 110 Gbyte/s (independent of RD vs. WR, half duplex)
* clock frequency 2.4 GHz
Calculate the time in CPU cycles per 16 iterations (1 cache line length) that it takes to transfer the required data over each of the data paths: L1-L2, L2-L3, L3-memory for a single core.
Balance. Calculate the in-memory code balance for loops with the following loop bodies, assuming that all arrays do not fit into any cache (a loop over i [and j, if present] is always implied, nontemporal stores do not apply):

(a) (5 crd) s = s + a[i]; (single precision)
(b) (5 crd) s = s + a[i]*a[i]; (double precision)
(c) (5 crd) a[i][j] = a[i][j] + s*c[i][j]; (single precision, inner loop over j)
(d) (5 crd) a[i] = s*b[i]; (int32)
(e) (15 crd) x[i] = s*v[index[i]]; (DP for x[] and v[], and int32 for index[])

In case (e) you have to be a little imaginative; obviously it is impossible to give a single, definite answer...
Go back to Assignment 0 (the \( \pi \) code. If you do not have the code, please use the attached integrate.c file). Compile the code with the following options:
(a) -O3 -xCORE-AVX512 -qopt-zmm-usage=high (AVX512)
(b) -O3 -xAVX2 (AVX2)
(c) -O3 -xSSE4.2 (SSE 4.2)
(d) -O1 -no-vec (scalar)

Run the four binaries on a Fritz core with appropriate problem sizes (i.e., the loop length should be long enough to get stable measurements). The vector length decreases from 512 bit to 128 bit and finally to 64 bit (no vectorization at all).
The SQRT instruction is very slow on this CPU. You can safely assume that it dominates the runtime of each loop iteration, and that all other work is hidden behind it. It exists in all variants, from 512-bit wide to scalar.

(10 crd) How many cycles does each variant of the SQRT instruction take?

(15 crd) Is the result what you expected? If not, what could be the reason?

integrate.c
19 May 2026, 8:40 AM