Assignment 9: Slow computing, more OpenMP, histogram fun
Completion requirements
Opened: Wednesday, 3 July 2024, 12:00 AM
Due: Monday, 15 July 2024, 10:03 AM
- Slow computing.
It is often stated (sometimes humorously) that all you have to do to make your program scale is to compile it with the "-O0" optimization flag. Substantiate this claim using a speedup model based on Amdahl's Law.
Assume that Amdahl's Law holds for a given code with serial execution time \(T(1)=T_s+T_p\), where \(T_s\) is the serial runtime of the non-parallelizable part and \(T_p\) is the serial runtime of the perfectly parallelizable part. In the parallel case with \(N\) workers, an additional runtime overhead of \(T_c(N)\) applies.
(a) (5 crd) Formulate a model for the strong-scaling speedup function \(S(N)\) in terms of dimensionless quantities \(s\), \(p\), and \(c(N)\) for serial fraction, parallel fraction, and dimensionless communication overhead.
(b) (20 crd) Now assume that everything that pertains to code execution is slowed down by a given factor \(\sigma<1\), i.e., for \(\sigma=0.5\) the code executes with half the original performance. However, the communication time stays unchanged. This mimics the "slow program" scenario, the use of slower CPUs, or a clock frequency reduction. The speedup is now a function of \(\sigma\), i..e, \(S(N,\sigma)\). Explain if the statement from the beginning is justified, i.e., can you improve the scaling of your code by making it run slower? What is the only prerequisite for that (assuming that Amdahl's Law with communication holds)? Is speedup a good metric for assessing the quality of a parallel code?
(c) (15 crd) Now assume a specific case of \(s=0\) (no serial fraction) and \(c(N)=\alpha N\), which describes a typical barrier synchronization overhead if the barrier is not implemented optimally. \(\alpha\) is a dimensionless constant. The single-worker performance (without slowdown) is \(P(1)\). Calculate how \(P(N,\sigma)\) scales for \(N\gg 1\). How does it depend on \(\sigma\)? Is this an expected result? Elaborate! - (10 crd) OpenMP fun.
There is a correctness problem with this OpenMP parallelization. Describe it and fix it.
double x[100],y[100],tmp;
int i;
double work1(int);
// initialization code etc. omitted
#pragma omp parallel
{ #pragma omp for for(i=0; i<100; i++) { tmp=work1(i); x[i]=x[i]+tmp; } #pragma omp for for(i=1; i<100; i++) { y[i]=x[i-1]*y[i]; } } - Parallel histogram computation.
The following code fragment calculates a histogram with 16 bins from the results of the least significant four bits of the standard rand_r() random number generator:
unsigned int seed = 1;
long hist[16];
for(int i=0; i<16; ++i)
hist[i]=0;wcstart = getTimeStamp();
for(long i=0; i<2000000000; ++i) { hist[rand_r(&seed) & 15]++; } wcend = getTimeStamp(); for(int i=0; i<16; ++i) { cout << "hist[" << i << "]=" << hist[i] << endl; } cout << "Time: " << wcend-wcstart << " sec" << endl;
(a) (20 crd) Parallelize the histogram calculation correctly using OpenMP. You can find a serial example code in the ~ptfs100h/GettingStarted/HISTO folder.
Hint: Try different solutions. Think simple first.
(b) (10 crd) Draw performance in iterations per second vs. number of cores in a diagram (use up to 18 cores on one Fritz ccNUMA domain at fixed 2.0 GHz). Does your solution scale? If it does not, make it scale!
(c) (20 crd) Use the scalable code to estimate the OpenMP overhead in cycles for a fused "parallel for" directive. To do this, reduce the number of iterations to a point where the overhead becomes large enough to be easily measurable. You may have to repeat the benchmark loop multiple times to get proper timings.