PTfS25: Assignment 1: Code execution

Opened: Tuesday, 6 May 2025, 12:00 AM

Due: Thursday, 15 May 2025, 10:05 AM

Pipelines. (20 credits) Assume a CPU core with floating-point multiply pipeline that has a depth (latency) of 5 cycles and can deliver 1 result per cycle at maximum. Calculate the required number of independent MULT instructions to achieve half of the maximum throughput!
More pipelines. (30 credits) Consider the following code:
```
    double a[...];
    double s=0.1;
    // a[] contains sensible data
    for(int i=2; i<N; ++i) {
      a[i] += s * a[i-2];
    }
```
This code is run on a superscalar, out-of-order CPU core with the following properties:
- Floating-point ADD pipeline depth of 6 cy
- Floating-point MULT pipeline depth of 8 cy
- Capability of executing 1 ADD, 1 MULT, 1 LOAD, and 1 STORE instruction per cycle (no FMA)
- Overall instruction throughput limit of 4 instructions retired per cycle
- Register set of 16 floating-point registers and 16 integer registers
- No SIMD capability
We assume that the required data (i.e., a[]) resides in the L1 cache. We also assume N to be large enough so that wind-up/down effects can be ignored. Assuming the compiler produces perfect code, calculate the expected performance of the loop in flops/cy.
Logarithm in depth. (20 credits) Look again at the integration code from Assignment 0. In the main loop, function values are accumulated into a summation variable. Assuming that the divide instruction is the sole bottleneck in this calculation, and that it has a latency of 6 cycles and a throughput of 0.25 ins/cy, describe the optimization the compiler must apply to achieve optimal performance for this loop!
Memory bandwidth. A 36-core CPU in the Ice Lake part of the Fritz cluster has an achievable memory bandwidth of about 160 Gbyte/s and runs at a base clock frequency of 2.4 GHz.

(a) (10 credits) Calculate how many bytes the memory interface can deliver per cycle and core!
(b) (20 credits) Assuming that each core has two double-precision ADD pipelines with a throughput of one ADD operation per cycle each, estimate the fraction of the ADD peak performance that can be achieved if each ADD operand must come from main memory and the results are stored back into memory:
```
double x[], y[], a[], b[], c[], d[];
for(i=0; i<N, i++) {
  y[i] = a[i] + b[i];
  x[i] = c[i] + d[i];
}
```