1. Triangular parallel MVM. Consider the multiplication of a dense lower triangular matrix with a vector:

    for(r=0; r<N; ++r)
    for(c=0; c<=r; ++c)
    y[r] += a[r][c] * x[c];
    Write a benchmark code for this kernel.

    (a) Parallelize the code with OpenMP and static loop scheduling. Which loop should be parallelized, and why? Run the code with 1,...,10 threads on one Emmy socket at problem sizes of 1600x1600 and 8000x8000, respectively. Draw a graph of performance in Mflop/s vs. cores (with the two data sets). Discuss the emerging patterns! 
    (b) Can you improve the performance by choosing an appropriate loop schedule? Present the data as in (a) but with the best loop schedule you could find.
    (c) Use the Roofline model to calculate an upper performance limit of the code at problem size 8000x8000 on one Emmy socket. (Hint: The read-only memory bandwidth is about 47 Gbyte/s). Draw the limit into your diagrams above.

  2. Parallel histogram. The following code fragment calculates a histogram with 16 bins from the results of the least significant four bits of the standard rand_r() random number generator:


    unsigned int seed = 123;
    long hist[16];
    for(int i=0; i<16; ++i)
      hist[i]=0;
      timing(&wcstart, &ct);
    for(long i=0; i<1000000000; ++i) {
        hist[rand_r(&seed) & 0xf]++;
      }
    timing(&wcend, &ct);
    for(int i=0; i<16; ++i) {
        cout << "hist[" << i << "]=" << hist[i] << endl;
      }
    cout << "Time: " << wcend-wcstart << " sec" << endl;


    Parallelize the histogram calculation using OpenMP. You can find serial (Fortran and C) example codes in the ~unrz55/GettingStarted/HISTO folder. Draw performance vs. number of cores in a diagram. Does your solution scale? If it does not, make it scale!

    Hint: Try different solutions. Think simple first.


Last modified: Thursday, 19 November 2020, 12:57 PM