Skip to main content
NHR Learning Platform
  • Home
  • Calendar
  • More
You are currently using guest access
Log in
NHR Learning Platform
Home Calendar
Expand all Collapse all
  1. Dashboard
  2. PTfS26
  3. 27 April - 3 May
  4. Assignment 1: Code execution

Assignment 1: Code execution

Completion requirements
Opened: Wednesday, 29 April 2026, 12:00 PM
Due: Thursday, 7 May 2026, 10:05 AM

  1. Pipelines. (25 credits) Assume a CPU core with two floating-point add pipelines that have a depth (latency) of 4 cycles and can each deliver 1 result per cycle at maximum. Calculate the required number of independent ADD instructions to achieve 90% of the maximum possible throughput!

  2. More pipelines. Consider the following code:

        double a[...], b[...], c[...];
        double s=0.1;
        // a[], b[], c[] contain sensible data
    for(int i=2; i<N; ++i) {
    a[i] += s * a[i-2];
    }
       for(int i=2; i<N; ++i) {
    b[i] += s * b[i-2];
    }
       for(int i=2; i<N; ++i) {
    c[i] += s * c[i-2];
    }
     

    This code is run on a superscalar, out-of-order CPU core with the following properties:

    • Floating-point ADD pipeline depth of 6 cy
    • Floating-point MULT pipeline depth of 8 cy
    • Capability of executing 1 ADD, 1 MULT, 1 LOAD, and 1 STORE instruction per cycle (no FMA)
    • Overall instruction throughput limit of 4 instructions retired per cycle
    • Register set of 16 floating-point registers and 16 integer registers
    • No SIMD capability

    We assume that the required data (i.e., a[], b[], c[]) resides in the L1 cache. We also assume N to be large enough so that wind-up/down effects can be ignored.
    (a) Assuming the compiler compiles each loop separately and produces perfect code, calculate the expected performance of the loop in flops/cy. (20 crd)
    (b) Which simple optimization could the compiler apply to improve the performance of the code? What would be the optimal performance that could be achieved? (15 crd)

  3. Square root in depth. Look again at the integration code from Assignment 0. If you do not have the code, please use the attached integrate.c file.
    In the main loop, function values are accumulated into a summation variable.
    (a) Describe the optimization the compiler must apply to achieve optimal performance for this loop! (10 crd)
    (b) Compile the code with the icx compiler options -O1 -no-vec -xHost. This prevents the compiler from vectorizing the loop (i.e., no SIMD instructions are used). Run it with a loop length of N=109 and measure the time (in cycles) per iteration. Make reasonable assumptions about the machine instructions the loop comprises; e.g., there have to be a SQRT, some MULTs and ADDs, maybe FMAs, a conversion from integer to floating point (for the loop counter), and of course the "loop mechanics" (increment, compare, conditional branch). Calculate the IPC value when the loop is running! (30 crd)

  • integrate.c integrate.c
    4 May 2026, 8:51 AM
You are currently using guest access (Log in)
Data retention summary
Powered by Moodle