PTfS25: Assignment 2: More code execution, loop benchmarking

Opened: Tuesday, 13 May 2025, 12:00 AM

Due: Thursday, 22 May 2025, 10:05 AM

Peak performance. Floating-point computations are an important resource of a processor. Assume a CPU chip with the following properties:
- Clock frequency of 2.0 GHz
- 52 cores
- AVX-512 instruction set (512-bit wide SIMD registers); each core is capable of retiring two full-width double-precision FMA instructions per cycle
(10 crd) Calculate the theoretical double-precision floating-point peak performance of the chip in Tflop/s. You may assume that the FMA SIMD units are fully parallel, i.e., the computations across the SIMD lanes are done concurrently.
Pipelines. Consider the following code:
```
    double a[...],b[...];
    double s=0.0, t=1.234;
    // a[] and b[] contain sensible data
    for(int i=0; i<N; ++i) {
      s += a[i]*a[i];
      b[i] *= t;
    }
```
This code is run on a superscalar, out-of-order CPU core with the following properties:
- Floating-point ADD pipeline depth of 5 cy
- Floating-point MULT pipeline depth of 8 cy
- Capability of executing 1 ADD, 1 MULT, 1 LOAD, and 1 STORE instruction per cycle (no FMA)
- Overall instruction throughput limit of 4 instructions retired per cycle
- Register set of 16 floating-point registers and 16 integer registers
- No SIMD capability
We assume that the required data (i.e., arrays a[] and b[]) resides in the L1 cache.
Answer the following questions:

(a) (15 credits) Assuming the compiler does not know about Modulo Variable Expansion (MVE) but otherwise produces perfect code, what is the hardware bottleneck that limits the performance of the code? Calculate the expected performance in flops/cy. You can ignore the effect of the "loop mechanics" (increment, compare, branch).
(b) (15 credits) Now assume that the compiler can employ "perfect" MVE. How much MVE unrolling must be applied at least to get to optimal performance? What is the hardware bottleneck for the "best possible" code? Calculate the best possible performance in flops/cy.
Loop kernel benchmarking is a variant of microbenchmarking - running simple stuff that lets you get insight into the inner workings of a machine. The first step in this process is to get it right and to present the data correctly. This is what this task is all about.

Write a benchmark program that measures the performance in MFlop/s of the following two loop kernels (loop over i is implied):

(a) (25 credits) Schönauer triad: a[i] = b[i] + c[i] * d[i](b) (25 credits) DAXPY: a[i] = s * b[i] + a[i]

a, b, c, d are double-precision arrays of length N. s is a double-precision scalar that you should define as a compile-time constant:

const double s = 1.00000000001;

Allocate memory for all arrays on the heap, i.e. , using malloc() in C and new in C++. Do not forget to initialize all data elements with valid floating-point (FP) numbers. Using calloc() is not sufficient - see this blog post for an explanation.

Use compiler options -O3 -xHost -fno-alias and run your code using the following vector lengths (in elements): N=int(1.5^r), r = 8 ... 44. This will give you a decent resolution and equidistant points on a logarithmic x axis if (see below).

Perform the measurement on one core of the Fritz cluster for the given loop lengths (do not forget to set the clock frequency to 2.2 GHz). For reasons of accuracy, make sure that the runtime of each kernel with each vector/matrix size is larger than 0.1 seconds by repeating the computation kernel in an outer loop. Maybe it is a good idea to dynamically adjust the number of repetitions depending on the runtime of the kernel:
```
            NITER=1;
            do {
              // time measurement
              wct_start = getTimeStamp();

              // repeat measurement often enough
              for(k=0; k<NITER; ++k) {
                // This is the benchmark loop
                for(i=0; i<N; ++i) {
                 // put loop body here: a[i] = ...
                }
                // end of benchmark loop
                if(a[N/2]<0.) printf("%lf",a[N/2]); // prevent compiler from eliminating loop
              }
              wct_end = getTimeStamp();
              if(wct_end-wct_start>0.1) break; // at least 100 ms
              NITER = NITER*2;
            } while (1); 

            printf("Total walltime: %f, NITER: %d\n",wct_end-wct_start,NITER);
            
```
Make sure that the operations in the kernel actually get executed - compilers are smart! (This is what the bogus if statement is for: The compiler must not be able to determine the result of the condition at compile time. Example: if(a[N/2]<0.) - if all arrays are initialized with positive numbers, this condition is never true.) Use the standard compiler options -O3 -xHost -fno-alias. A compilable skeleton code can be found in ~ptfs100h/GettingStarted/scan_c.c.

Use your favorite graphics program (e.g., gnuplot or xmgrace or LibreOffice Calc or ...) to generate plots of the performance in MFlop/s vs. N. Choose a logarithmic scale on the x axis (think about why this is advisable). Always let the y axis start at zero. If you don't, the graph may be misleading and you will collect some very bad karma.

Note: You know what these graphs should look like from the lecture. Explain if the general shape is what you expected (10 credits).