ParProg20: Assignment 6 | NHR Learning Platform

Parallel Gauss-Seidel. We want to parallelize a 2D Gauss-Seidel solver using OpenMP and a pipeline parallel processing strategy (as outlined in the lecture). We stick to the original algorithm (i.e., no fishy red-black tricks etc.). A serial example code can be found in ~unrz55/GettingStarted/gs_plain.c.

(a) Parallelize the code along the same lines as shown in the lecture for the 3D case. Run your code with a problem size of 8000x8000 on 1..10 threads of an Emmy socket and report performance vs. thread count. What ccNUMA-aware data placement strategy should be implemented when running the code on two sockets?

(b) What is the expected full-socket performance assuming bandwidth-bound execution?

(c) Estimate the impact of thread synchronization (barriers), assuming a 2000-cycle penalty for a full-socket OpenMP barrier (5000 cycles for the full node). Does it impact the performance significantly?

(d) Will SMT (Simultaneous Multi-Threading) help with the performance of your parallel solver? Find out how to pin your threads correctly and report performance with 1..20 threads on one Emmy socket. Does the result match your expectations?
Tasking for the ray tracer. Reimplement the parallel ray tracer using the task feature of OpenMP and compare the performance with your best result on 20 Emmy cores.
SIMD for polynomials. Consider the following function, which evaluates a polynomial at position x:
```
double poly_eval(double x, int deg, double *coeff) {
  double f=0.;
  for(int i=0; i<deg+1; ++i) { 
    f = x*f+coeff[i]*x; 
  } 
  return f; 
}
        
```
The coefficients of the polynomial are specified in the array coeff[]. Write a version of the function that can be called from a SIMD-vectorized loop:
```
double coeff[11];
#pragma omp simd 
for(int i=0; i<N; ++i) {
  f[i] = poly_eval(x[i], 10, coeff);
}
```
Can you quantify the performance gain by running a suitable benchmark? Use a polynomial of degree 10.

Last modified: Thursday, 26 November 2020, 4:20 PM