MPIX-HLRS: MPI+OpenMP: jacobi - hybrid through OpenMP parallelization

jacobi wp1

Prepare for these exercises:

cd ~; cp -a ~xwwclabs/MPIX-HLRS/J2D . # copy the exercises

cd ~/MPIX-HLRS/J2D/C # change into your C directory .OR. …

cd ~/MPIX-HLRS/J2D/F # … .OR. into your Fortran directory

jacobi wp2

Jacobi Exercises: (→ see also slides ##-##)

♦ This is a 2D Jacobi solver (5 point stencil) with a 1D domain decomposition and halo exchange.

♦ The given code is MPI-only.

♦ You can build it with: make (take a look at the Makefile)

♦ Edit the job-submission file: vi job.sh

♦ Run it with: qsub -q R_mpix job.sh

♦ Solutions are provided in the directory: solution/

1. BASIC EXERCISE (see step-by-step below)

Parallelize the code with OpenMP to get a hybrid MPI+OpenMP code. By default, the code is started with a problem size of 100x100 and 1000 time steps. At the end, it dumps the grid to disk. You can check for correctness by comparing the dump (with cmp) to the reference with the same file name in the "ref" folder (please use a small number of MPI processes, e.g. 4, for that comparison to ensure that all MPI processes do get some data to work on).
Run it effectively on the given hardware. For that you need to set a proper problem size (100x100 is much too small). Start with a problem size of 10000x10000 and 10 time steps on 1 node (or a problem size of 14000x14000 and 10 time steps on 2 nodes). How do you see if the size is "sufficient"?
Learn how to take control of affinity with MPI and especially with MPI+OpenMP.

NOTES:

The code is strongly memory bound for reasonably large problem sizes.
Always run multiple times and observe run-to-run performance variations.
If you know how, try to calculate the maximum possible performance (ROOFLINE).

STEP-BY-STEP:

→ Run the MPI-only code with 1,2,3,4,.... processes (in the course you may use up to 4 nodes),
and observe the achieved performance behavior.

→ Learn how to take control of affinity with MPI.

jacobi roofline standard → Parallelize the appropriate loops with OpenMP (see Documentation links below).

→ Run with OpenMP and only 1 MPI process ("OpenMP-only") on 1,2,3,4,...,all cores of 1 node,
and compare to the MPI-only run.

→ Run hybrid variants with different MPI vs. OpenMP ratios.
What's the best performance you can get with using all cores on 4 nodes?

!!! Does the OpenMP/hybrid code perform as well as the MPI-only code?
→ If it doesn't, fix it!

RECAP - you might want to look into the slides or documentation:

→ Memory placement - First touch! (→ see also slides ##-##)

→ PINNING = previous exercises ( → see also slides ##-##)

→ Documentation ( → see at Miscellaneous information)

2. INTERLUDE: ROOFLINE MODEL AND LIGHT-SPEED PERFORMANCE

→ see images

3. EXERCISE ON OVERLAPPING COMMUNICATION AND COMPUTATION

Finally, let's overlap communication and computation...

1. substitute the omp for by a taskloop:

parallel{ single{ taskloop for{<compute stencil and update>}}}

--> this allows you to see the overhead of taskloop

--> maybe you need a larger problem size to work on (input)

--> grainsize might help...

2. overlapping communication and computation:

parallel{ single{ task{halo exchange + halo rows} taskloop{internal computation} }}

Last modified: Wednesday, 22 January 2025, 12:10 PM

Hybrid Programming in HPC - MPI+X @ HLRS

MPI+OpenMP: jacobi - hybrid through OpenMP parallelization