CLPE: Hands-on #5: Gauss-Seidel | NHR Learning Platform

In this exercise, we want to analyze the 2D Gauss-Seidel (GS) kernel. In this algorithm we iterate over a two-dimensional grid and update a cell in-place based on its four neighbor cells.

You can find the code here: https://hpc-mover.rrze.uni-erlangen.de/compiler-explorer/z/jfn5jr

1. Compile the code using icpc 2021.6.0 and use the optimization flags "-Ofast -qopenmp-simd -qopt-zmm-usage=low -xHost -fargument-noalias -funroll-loops -fno-builtin" and create an executor with the same flags. Run the code with a grid of 20x200 (~31 kB).

2. Identify the region of interest (ROI) and analyze it with OSACA. Does your measurement meet your expectations? What is the limiting bottleneck?

3. Now replace the benchmark loop (lines 142-154) with an optimized version to overcome dependencies:

    // benchmark loop

        for (int i=1; i<NI-1; ++i) {
        t = phi[i][0];
        for (int k=1; k<NK-1; ++k) {
            t1 = (phi[i+1][k] + phi[i][k+1]) * 0.25;
            t2= t1 + (phi[i-1][k]) * 0.25;
            t = t2 + 0.25 * (t);
            phi[i][k] = t;
        }        // make sure the execution of rows can't overlap
        phi[i+1][0] = phi[i][NK-2];
    }
    // make sure the execution of grids can't overlap
    phi[1][0] = phi[NI-2][NK-2];
    // end of benchmark loop

a) Analyze the new code and measure it. Did the overall prediction and/or the bottleneck change?
b) Do the same analysis with the -O1 flag instead of -Ofast in your compiler options. This will make the compiler optimize for the smallest code size and not for performance. What can you see now and why?
c) Test out at least two other compilers or compiler versions with either -Ofast, -O3, or -O1 flag and try to find out how they try to optimize the kernel. Note that for Clang and GCC, you need to use -fopenmp-simd -march=native -funroll-loops -fno-builtin besides the -O option instead of the in flags mentioned in 1..

Last modified: Tuesday, 8 October 2024, 4:23 PM