Assignment 2: Loop benchmarking, Hockney model
Completion requirements
Opened: Tuesday, 7 May 2024, 12:00 AM
Due: Thursday, 16 May 2024, 10:03 AM
- Loop kernel benchmarking
is a variant of microbenchmarking - running simple stuff that lets you
get insight into the inner workings of a machine. The first step in this
process is to get it right and to present the data correctly. This is
what this task is all about.
Write a benchmark program that measures the performance in MFlop/s of the following two loop kernels (loop over i is implied):
(a) (30 credits) STREAM triad:a[i] = b[i] + s * d[i]
(b) (30 credits) DAXPY:a[i] = s * b[i] + a[i]
a, b, d are double-precision arrays of length N. s is a double-precision scalar that you should define as a compile-time constant:
const double s = 1.00000000001;
Allocate memory for all arrays on the heap, i.e. , usingmalloc()
in C andnew
in C++. Do not forget to initialize all data elements with valid floating-point (FP) numbers. Using calloc() is not sufficient - see this blog post for an explanation.
Use compiler options -O3 -xHost -fno-alias and run your code using the following vector lengths (in elements): N=int(1.5r), r = 8 ... 44. This will give you a decent resolution and equidistant points on the x axis (see below).
Perform the measurement on one core of the Fritz cluster for the given loop lengths (do not forget to set the clock frequency to 2.2 GHz). For reasons of accuracy, make sure that the runtime of each kernel with each vector/matrix size is larger than 0.1 seconds by repeating the computation kernel in an outer loop. Maybe it is a good idea to dynamically adjust the number of repetitions depending on the runtime of the kernel:
NITER=1; do { // time measurement wct_start = getTimeStamp(); // repeat measurement often enough for(k=0; k<NITER; ++k) { // This is the benchmark loop for(i=0; i<N; ++i) { // put loop body here: a[i] = ... } // end of benchmark loop if(a[N/2]<0.) printf("%lf",a[N/2]); // prevent compiler from eliminating loop } wct_end = getTimeStamp(); NITER = NITER*2; } while (wct_end-wct_start<0.1); // at least 100 ms NITER = NITER/2; printf("Total walltime: %f, NITER: %d\n",wct_end-wct_start,NITER);
Make sure that the operations in the kernel actually get executed - compilers are smart! (This is what the bogusif
statement is for: The compiler must not be able to determine the result of the condition at compile time. Example:if(a[N/2]<0.)
- if all arrays are initialized with positive numbers, this condition is never true.) Use the standard compiler options-O3 -xHost -fno-alias
. A compilable skeleton code can be found in~ptfs100h/GettingStarted/scan_c.c
.
Use your favorite graphics program (e.g., gnuplot or xmgrace or LibreOffice Calc or ...) to generate plots of the performance in MFlop/s vs. N. Choose a logarithmic scale on the x axis (think about why this is advisable). Always let the y axis start at zero. If you don't, the graph may be misleading and you will collect some very bad karma.
(c) (10 credits) For both cases (a and b above), calculate the number of bytes that are transferred over the memory bus per loop iteration if the array size N is large. - Latency and bandwidth. Consider a CPU with an asymptotic memory bandwidth of \( b_S=150\,\mathrm{GB/s} \) and a memory latency of \( T_\ell=120\,\mathrm{ns} \). Assume that the duration of data transfer follows the Hockney Model as described in the lecture.
(a) (10 credits) Calculate the time of data transfer and the effective memory bandwidth for a message of size 4096 byte.
(b) (10 credits) For general values of \( T_\ell \) and \( b_S \), calculate the message size \( N_{1/2}(T_\ell,b_S) \) at which \( B_\mathrm{eff}(N_{1/2})=b_S/2 \), i.e., at which the effective bandwidth is half the asymptotic bandwidth.
(c) (10 credits) At a cache line size of 128 byte, calculate how many outstanding prefetches are needed to fully hide the memory latency and how much data (in bytes) must thus be kept "in flight."