Skip to main content
NHR Learning Platform
  • Home
  • More
You are currently using guest access
Log in
NHR Learning Platform
Home
Expand all Collapse all
  1. Dashboard
  2. PTfS25
  3. 5 May - 11 May
  4. Assignment 1: Code execution

Assignment 1: Code execution

Completion requirements
Opened: Tuesday, 6 May 2025, 12:00 AM
Due: Thursday, 15 May 2025, 10:05 AM

  1. Pipelines. (20 credits) Assume a CPU core with floating-point multiply pipeline that has a depth (latency) of 5 cycles and can deliver 1 result per cycle at maximum. Calculate the required number of independent MULT instructions to achieve half of the maximum throughput!

  2. More pipelines. (30 credits) Consider the following code:

        double a[...];
        double s=0.1;
        // a[] contains sensible data
    for(int i=2; i<N; ++i) {
    a[i] += s * a[i-2];
    }

    This code is run on a superscalar, out-of-order CPU core with the following properties:

    • Floating-point ADD pipeline depth of 6 cy
    • Floating-point MULT pipeline depth of 8 cy
    • Capability of executing 1 ADD, 1 MULT, 1 LOAD, and 1 STORE instruction per cycle (no FMA)
    • Overall instruction throughput limit of 4 instructions retired per cycle
    • Register set of 16 floating-point registers and 16 integer registers
    • No SIMD capability

    We assume that the required data (i.e., a[]) resides in the L1 cache. We also assume N to be large enough so that wind-up/down effects can be ignored. Assuming the compiler produces perfect code, calculate the expected performance of the loop in flops/cy.

  3. Logarithm in depth. (20 credits) Look again at the integration code from Assignment 0. In the main loop, function values are accumulated into a summation variable. Assuming that the divide instruction is the sole bottleneck in this calculation, and that it has a latency of 6 cycles and a throughput  of 0.25 ins/cy, describe the optimization the compiler must apply to achieve optimal performance for this loop!

  4. Memory bandwidth. A 36-core CPU in the Ice Lake part of the Fritz cluster has an achievable memory bandwidth of about 160 Gbyte/s and runs at a base clock frequency of 2.4 GHz.

    (a) (10 credits) Calculate how many bytes the memory interface can deliver per cycle and core! 
    (b) (20 credits) Assuming that each core has two double-precision ADD pipelines with a throughput of one ADD operation per cycle each, estimate the fraction of the ADD peak performance that can be achieved if each ADD operand must come from main memory and the results are stored back into memory:

    double x[], y[], a[], b[], c[], d[];
    for(i=0; i<N, i++) {
      y[i] = a[i] + b[i];
      x[i] = c[i] + d[i];
    }




You are currently using guest access (Log in)
Data retention summary
Powered by Moodle