PTfS26: Assignment 3: Data transfer | NHR Learning Platform

Opened: Tuesday, 12 May 2026, 12:00 PM

Due: Wednesday, 20 May 2026, 10:05 AM

Cache optimization. (10 crd) Consider the following code:
```
for(int r=0; r<N_rows; ++r) {
  for(int c=0; c<N_cols; ++c) {
    s += m[r][c];
  }
}
    
```
Someone tells you: "To make the code faster, we have to optimize cache usage. To do that, we have to work on the matrix in tiles of \( b\times b \) elements and choose \( b \) so that a submatrix of \( b\times b \) elements fits into cache. Hence, we do not go through the matrix row by row and column by column but read submatrices of size \( b\times b \):"
```
    for(int rs=0; rs<Nrows; rs+=b) {
      for(int cs=0; cs<Ncols; cs+=b) {
        for(int r=rs; r<rs+b; ++r) {
          for(int c=cs; c<cs+b; ++c) {
            s += m[r][c];
          }
        }
      }
    }
```
(For the sake of simplicity, we are assuming that Nrows and Ncols are multiples of \( b \)).
What is your opinion on that? Give a detailed explanation.

An intricate stencil code. (30 crd) Predict the maximum in-L1 performance of the following loop nest on a "Ivy Bridge" core in flops/cycle (assuming double-precision arithmetic):

for(int k=4; k < N-4; k++) {
 for(int j=4; j < N-4; j++) {
  for(int i=4; i < N-4; i++) {
   U[k][j][i] = 2*V[k][j][i] - U[k][j][i] + C[k][j][i] * 
    ( c0 *  V[k][j][i]
     +c1 * (V[ k ][ j ][i+1]+V[ k ][ j ][i-1] 
           +V[ k ][j+1][ i ]+V[ k ][j-1][ i ]
           +V[k+1][ j ][ i ]+V[k-1][ j ][ i ]) 
     +c2 * (V[ k ][ j ][i+2]+V[ k ][ j ][i-2]
           +V[ k ][j+2][ i ]+V[ k ][j-2][ i ] 
           +V[k+2][ j ][ i ]+V[k-2][ j ][ i ])
     +c3 * (V[ k ][ j ][i+3]+V[ k ][ j ][i-3] 
           +V[ k ][j+3][ i ]+V[ k ][j-3][ i ]
           +V[k+3][ j ][ i ]+V[k-3][ j ][ i ]) 
     +c4 * (V[ k ][ j ][i+4]+V[ k ][ j ][i-4]
           +V[ k ][j+4][ i ]+V[ k ][j-4][ i ] 
           +V[k+4][ j ][ i ]+V[k-4][ j ][ i ])); 
  }
 }
}

Consider AVX and scalar execution and identify the relevant execution bottleneck in each case.

Dense MVM. (15 crd) The following loop nest multiplies a matrix a[][] with a vector b[] and stores the result in a vector c[]:
```
for(i=0; i<N; ++i) 
  for(j=0; j<N; ++j)
    c[i] += a[i][j] * b[j];
```
We set N=15000, and all data in c, a, and b are double-precision floating-point numbers. This serial code is run on a quad-core processor with cache sizes of 32 KiB (per-core L1), 256 KiB (per-core L2), and 8 MiB (shared L3), a peak performance of 4 GFlop/s (ADD+MULT) per core, and a memory bandwidth of 16 GB/s. Apart from one multiply and one add instruction, a core can execute one load or one store instruction per cycle (throughput). There is no SIMD capability. The ADD instruction has a latency of four cycles, and the MULT instruction has a latency of six cycles.

Identify spatial and temporal access locality in this code. Explain which data structures lead to which locality of access, and why.
Loop nests.

(a) (10 crd) What is "wrong" with the following code? How would you fix it?
```
#define N 16384
//...
s = 0.0;
for (int i=0; i<N; i++) {
  for(int j=0; j<N; j++) {
     s += a[j][i];  
  }
}
```
Now assume a processor (single core) with a clock speed of 2.0 GHz, a maximum memory bandwidth of 128 Gbyte/s, a cache line length of 64 byte, and a memory latency of 100 ns. It executes the code above in its "fixed" form. We assume that the compiler did a good job and applied the necessary unrolling etc. to optimize for pipelining in the core.

(b) (15 crd) Calculate the maximum memory bandwidth the code can draw from memory on our CPU without any prefetching. How does this value change if we could double the cache line length to 128 byte?
(c) (20 crd) How many concurrently outstanding load misses are required to saturate the memory bandwidth of 128 Gbyte/s (with the default cache line size of 64 byte)? Describe what other prerequisites must be fulfilled to actually achieve that value with the code above!