Assignment 7: More OpenMP, Performance Modeling
Completion requirements
Opened: Wednesday, 19 June 2024, 12:00 AM
Due: Thursday, 27 June 2024, 10:03 AM
- Weird π.
This problem was motivated by a student's question in the forum.
In our OpenMP-parallel "π by integration" program, a popular mistake is to forget to privatize the temporary variable x in the loop kernel:
#pragma omp parallel for reduction(+:sum)
Interestingly, depending on the compilation flags we still get the correct value for π, but also wildly fluctuating parallel performance numbers. We want to investigate this effect using hardware performance counters. For this you have to submit your job on Fritz with the
for(int i=0; i<N; i++) {
x = (i + 0.5) * delta_x;
sum = sum + 4.0 * sqrt(1.0 - x * x);
}-C hwperf
flag so you get access to the counters. You also need to load thelikwid
module. A working example code is available at~ptfs100h/GettingStarted/WeirdPi.c
Always run with a fixed clock frequency of 2.0 GHz.
(a) (5 crd) Compile the example code with the flags-O1 -no-vec -qopenmp
(scalar code) and execute it with 1,2,3,4 threads on the first cores of a socket of Fritz. Repeat the experiment 10 times. Report the result for π and the performance of the code in Gflop/s with minimum, maximum, and average values across runs.
(b) (5 crd) Compile the code again but now with options-O3 -xAVX -qopenmp
. What is the difference (apart from the obviously better performance) to the experiment in (a)?
(c) (10 crd) Run the two binaries withlikwid-perfctr
as follows:
srun --cpu-freq=2000000-2000000 likwid-perfctr -g DATA -C 0 ./a.out
The DATA group counts, among other things, the LOAD and STORE instructions executed by the core. How many LOADs and STOREs are executed per iteration in both cases? (You may use the marker API to restrict the counting to the actual numerical loop, but it will not make much of a difference in this case.)
(d) (10 crd) From your observations in (c), what is your hypothesis about why the performance of the scalar (and less optimized) code is sometimes so much worse in the parallel case? Also discuss a possible reason why the result for π still appears to be correct even though the code is obviously wrong.
- Resource-driven modeling.
There are two fundamental bottlenecks that govern the runtime of a loop on a CPU chip: memory data transfer and "work" execution. Based on this insight, we constructed the Roofline model in the lecture. One important assumption of the Roofline model was that the execution of instructions and the data transfer overlap in time. Here we want to drop this assumption and see what happens.
Assume that we have a CPU with a peak performance of 4.5 Tflop/s and a memory bandwidth of 300 Gbyte/s.
(a) (5 crd) In the Roofline plot for this CPU, what is the "critical intensity," i.e., at which intensity is the "knee" of the Roofline curve located? Make a plot of the Roofline limit, covering an intensity range of 0.01 to 1000 flop/byte.
(b) (10 crd) Now we assume that instruction execution and data transfer cannot overlap, i.e., the CPU either executes "work" or it transfers data. What is the upper performance limit w.r.t. computational intensity in this case? Draw this function into the same diagram as above.
Hint: To do this, you can start by calculating the time that the execution of the flops (at peak performance) takes and the time that the data transfers (at peak memory bandwidth) takes. The sum of both is the overall execution time. You can take it from there.
(c) (5 crd) At which intensity do the two models differ most? By which factor? Is this ratio expected, and why?
- Parallel ray tracer.
Parallelize a simple ray tracer code using OpenMP. You can find the C source in ~ptfs100h/GettingStarted/raytrace_c.c. Use the standard compilation options -O3 -xHost -qopenmp.
The picture is divided into so-called tiles (see figure). The function which is responsible for calculating a tile is called calc_tile(). Important parameters are size (the number of pixels in one direction) and tilesize (the number of pixels of one tile in one direction), both hard-coded at the start of main(). The program prints the picture in PNM format, you can use it to check for correctness and display the image with thegeeqie
tool that is installed on csnhr.
(a) (20 crd) Parallelize the code with OpenMP (include the code with your submission!) and draw a diagram of performance (in Mpixels/s) vs. number of cores on up to 72 physical cores of a Fritz node, using a picture size of 15000x15000 pixels and a tile size of 5000x5000 pixels. When parallelizing the code, make sure that it really computes the image correctly! Set the clock speed to 2.4 GHz. You can set the picture size and the tile size by command line arguments:
./a.out 15000 5000
(b) (20 crd) Does your code scale up to 72 cores, i.e., does it run 72 times faster than in sequential mode? If it does not, improve the scaling! How did you accomplish this? You have to keep the image size constant at 15000x15000 pixels, but you are free to make any other changes to the code and its parameters. (Hint. Depending on how you did the parallelization, you may want to look up the collapse clause in OpenMP. Also remember that not all pixels are equal.) Document all your settings so that your experiment can be reproduced.
(c) (10 crd) Estimate the necessary data transfer to/from memory and calculate how much time this takes at the maximum memory bandwidth of a Fritz socket of about 160 Gbyte/s. Based on this calculation and your measurements, is the data transfer to/from memory a bottleneck for this code?