PTfS25: Assignment 8: Resource-driven modeling, code optimization, raytracer

Opened: Monday, 23 June 2025, 12:00 AM

Due: Thursday, 3 July 2025, 2:15 PM

Resource-driven modeling. There are two fundamental bottlenecks that govern the runtime of a loop on a CPU chip: memory data transfer and instruction execution. Based on this insight, we can construct a simple model for the runtime of a loop.
Assume a loop of length N which, per iteration, requires a memory data transfer volume of V (in bytes) and performs W instructions. The CPU has a memory bandwidth of b_S (in bytes/s) and a peak execution capability of p_E (in instructions per cycle). The clock frequency is f. Assume that memory transfers and instruction execution are the only relevant resources.

(a) (5 crd) Using the information from above, calculate the expected execution time of the loop in cycles per iteration, assuming that code execution and data transfer cannot overlap.
(b) (5 crd) How does your answer to (a) change if you assume that there is full overlap between execution and data transfer?
(c) (10 crd) Construct a model for execution performance (instructions per second) from each of the cases (a) and (b).
Optimizing loop code. Look at the following loop code.
```
int N = 16384;
double mat[N][N],s[N][N];
// ...
srand(1);  // random seed
for(i=0; i<N ; ++i) { 
   val = (double)rand(); // random number
   for(j=0; j<N; ++j) {
      mat[j][i] = s[j][i]/val;
   }
 }
```
(a) (15 credits) What is wrong with this code (in terms of performance)? Suggest optimizations that will improve the performance (but will leave the numerical results intact, of course)! Which of them do you think may be employed by a compiler automatically?
(b) (15 credits) Construct a Roofline model for your optimized code for one socket of the Fritz cluster! You know the memory bandwidth and that a divide operation takes 2 cycles. According to your model, what is the expected upper performance limit and what is the expected bottleneck?
Parallel ray tracer.
Parallelize a simple ray tracer code using OpenMP. You can ﬁnd the C source in ~ptfs100h/GettingStarted/raytrace_c.c. Use the standard compilation options -O3 -xHost -qopenmp.

The picture is divided into so-called tiles (see ﬁgure). The function which is responsible for calculating a tile is called calc_tile(). Important parameters are size (the number of pixels in one direction) and tilesize (the number of pixels of one tile in one direction), both hard-coded at the start of main(). The program prints the picture in PNM format, you can use it to check for correctness and display the image with the geeqie tool that is installed on csnhr.

(a) (20 crd) Parallelize the code with OpenMP (include the code with your submission!) and draw a diagram of performance (in Mpixels/s) vs. number of cores on up to 72 physical cores of a Fritz node, using a picture size of 16000x16000 pixels and a tile size of 4000x4000 pixels. When parallelizing the code, make sure that it really computes the image correctly! Set the clock speed to 2.4 GHz. You can set the picture size and the tile size by command line arguments:
```
./a.out <size> <tilesize>
```
(b) (20 crd) Does your code scale up to 72 cores, i.e., does it run 72 times faster than in sequential mode? If it does not, improve the scaling! How did you accomplish this? You have to keep the image size constant at 16000x16000 pixels, but you are free to make any other changes to the code and its parameters. (Hint. Depending on how you did the parallelization, you may want to look up the collapse clause in OpenMP. Also remember that not all pixels are equal.) Document all your settings so that your experiment can be reproduced.

(c) (10 crd) One pixel of the image takes exactly 1 byte in memory. Estimate the necessary data transfer to/from memory and calculate how much time this takes at the maximum memory bandwidth of a Fritz socket of about 160 Gbyte/s. Based on this calculation and your measurements, is the data transfer to/from memory a bottleneck for this code?