PTfS26: Assignment 2: Loop kernels galore!

Opened: Tuesday, 5 May 2026, 12:00 PM

Due: Wednesday, 13 May 2026, 10:05 AM

Peak performance. Floating-point computations are an important resource of a processor. The upcoming "FireBS" chip by IntVidia has the following properties:
- Clock frequency of 2.05 GHz
- 126 cores
- The new, still undisclosed "ShmAVX-1024" instruction set (1024-bit wide SIMD registers); each core is capable of retiring one full-width double-precision FMA instructions per cycle, but the clock speed is reduced by 10% from the base value if a loop with such "hot code" is executed.
  (a) (10 crd) Calculate the theoretical double-precision floating-point peak performance of the chip in TFlop/s. You may assume that the FMA SIMD units are fully parallel, i.e., the computations across the SIMD lanes are done concurrently.
  (b) (20 crd) Each core has also a Load/Store unit, which can execute one full-width SIMD Load or one full-width SIMD Store per cycle. Consider the following code:
```
        double a[...],b[...];
        double s=0.0, t=1.234;
        // a[] and b[] contain sensible data
        for(int i=0; i<N; ++i) {
           s += a[i]*a[i];
           b[i] *= t;
        }
        
```
  Calculate the applicable peak performance P_max in Gflop/s for this code on all cores of the chip.
Loop kernel benchmarking is a variant of microbenchmarking - running simple stuff that lets you get insight into the inner workings of a machine. The first step in this process is to get it right and to present the data correctly. This is what this task is all about.

Write a benchmark program (40 credits) that measures the performance in MFlop/s of the following two loop kernels (loop over i is implied):

(a) Accumulate: a[i] = a[i] + b[i](b) Update: a[i] = s * a[i] + t

a, b are double-precision arrays of length N. s and t are a double-precision scalars that you should define as compile-time constants:

const double s = 1.00000000001, t = 1.00000000001;

Allocate memory for all arrays on the heap, i.e. , using malloc() in C and new in C++. Do not forget to initialize all data elements with valid floating-point (FP) numbers. Using calloc() is not sufficient - see this blog post for an explanation.

Use compiler options -O3 -xHost -fno-alias and run your code using the following vector lengths (in elements): N=int(1.5^r), r = 8 ... 44. This will give you a decent resolution and equidistant points on a logarithmic x axis (see below).

Perform the measurement on one core of the Fritz cluster for the given loop lengths (do not forget to set the clock frequency to 2.4 GHz). For reasons of accuracy, make sure that the runtime of each kernel with each vector length is larger than 0.1 seconds by repeating the computation kernel in an outer loop. Maybe it is a good idea to dynamically adjust the number of repetitions depending on the runtime of the kernel:
```
            NITER=1;
            do {
              // time measurement
              wct_start = getTimeStamp();

              // repeat measurement often enough
              for(k=0; k<NITER; ++k) {
                // This is the benchmark loop
                for(i=0; i<N; ++i) {
                 // put loop body here: a[i] = ...
                }
                // end of benchmark loop
                if(a[N/2]<0.) printf("%lf",a[N/2]); // prevent compiler from eliminating loop
              }
              wct_end = getTimeStamp();
              if(wct_end-wct_start>0.1) break; // at least 100 ms
              NITER = NITER*2;
            } while (1); 

            printf("Total walltime: %f, NITER: %d\n",wct_end-wct_start,NITER);
            
```
Make sure that the operations in the kernel actually get executed - compilers are smart! (This is what the bogus if statement is for: It tells the compiler that we "need" the result of the computation, but the compiler must not be able to determine the result of the condition at compile time. Example: if(a[N/2]<0.) - if all arrays are initialized with positive numbers, this condition is never true.) Use the standard compiler options -O3 -xHost -fno-alias. A compilable skeleton code can be found in ~ptfs100h/GettingStarted/scan_c.c.

(c) (10 credits) Use your favorite graphics program (e.g., gnuplot or xmgrace or LibreOffice Calc or ...) to generate plots of the performance in MFlop/s vs. N. Choose a logarithmic scale on the x axis (think about why this is advisable). Always let the y axis start at zero. If you don't, the graph may be misleading and you will collect some very bad karma.

(d) (20 credits) Explain the observable differences between the two graphs.