PTfS26: Assignment 1: Code execution

Opened: Wednesday, 29 April 2026, 12:00 PM

Due: Thursday, 7 May 2026, 10:05 AM

Pipelines. (25 credits) Assume a CPU core with two floating-point add pipelines that have a depth (latency) of 4 cycles and can each deliver 1 result per cycle at maximum. Calculate the required number of independent ADD instructions to achieve 90% of the maximum possible throughput!
More pipelines. Consider the following code:
```
    double a[...], b[...], c[...];
    double s=0.1;
    // a[], b[], c[] contain sensible data
    for(int i=2; i<N; ++i) {
      a[i] += s * a[i-2];
    }
    for(int i=2; i<N; ++i) {
      b[i] += s * b[i-2];
    }
    for(int i=2; i<N; ++i) {
      c[i] += s * c[i-2];
    }
```
```
 
```
This code is run on a superscalar, out-of-order CPU core with the following properties:
- Floating-point ADD pipeline depth of 6 cy
- Floating-point MULT pipeline depth of 8 cy
- Capability of executing 1 ADD, 1 MULT, 1 LOAD, and 1 STORE instruction per cycle (no FMA)
- Overall instruction throughput limit of 4 instructions retired per cycle
- Register set of 16 floating-point registers and 16 integer registers
- No SIMD capability
We assume that the required data (i.e., a[], b[], c[]) resides in the L1 cache. We also assume N to be large enough so that wind-up/down effects can be ignored.
(a) Assuming the compiler compiles each loop separately and produces perfect code, calculate the expected performance of the loop in flops/cy. (20 crd)
(b) Which simple optimization could the compiler apply to improve the performance of the code? What would be the optimal performance that could be achieved? (15 crd)
Square root in depth. Look again at the integration code from Assignment 0. If you do not have the code, please use the attached integrate.c file.
In the main loop, function values are accumulated into a summation variable.
(a) Describe the optimization the compiler must apply to achieve optimal performance for this loop! (10 crd)
(b) Compile the code with the icx compiler options -O1 -no-vec -xHost. This prevents the compiler from vectorizing the loop (i.e., no SIMD instructions are used). Run it with a loop length of N=10⁹ and measure the time (in cycles) per iteration. Make reasonable assumptions about the machine instructions the loop comprises; e.g., there have to be a SQRT, some MULTs and ADDs, maybe FMAs, a conversion from integer to floating point (for the loop counter), and of course the "loop mechanics" (increment, compare, conditional branch). Calculate the IPC value when the loop is running! (30 crd)

integrate.c
4 May 2026, 8:51 AM