MPIX-HLRS: MPI + accelerator exercises

Part 1: A simple integration code

We want to calculate the value of $ \pi $ by numerically integrating a function:

$ \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx $

We use the mid-point rule integration scheme, which works by summing up areas of rectangles centered around $x_i$ with a width of $\Delta x$ and a height of $f(x_i)$:

size_t SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (size_t i=0; i < SLICES; i++) {
  x = (i+0.5)*delta_x;
  sum += (4.0 / (1.0 + x * x));
}
Pi = sum * delta_x;

You can find copy the example code from a teacher's directory:

$ cp -a ~sct50053/DIV ~/MPIX-HLRS

Example programs in C and Fortran are in the DIV folder. There are makefiles you can modify if you feel this is necessary. There are also simple job scripts, which you should use to run your tests. The program takes the number of slices as the first command line argument (the default is 2 billion).

Investigate the node structure of the GPU nodes in the cluster with likwid-topology:

$ module load tools/likwid
$ likwid-topology -g | less -S

How many sockets/cores per socket/ccNUMA domains are there? Is hyper-threading enabled?
There is a simple but functional MPI-only code in the OpenMP folder. Use the NVIDIA compiler module for this:

$ module load compiler/nvidia/mpi-25.3

This automatically includes an appropriate OpenMPI module.

(a) Parallelize the integration loop with OpenMP (no offloading) and run it with several combinations of MPI process count and thread count per process. Make sure that the result is still as expected!
(b) What is the best performance in Git/s (billion iterations per second) you can achieve on the CPU cores of a full node? This is our CPU baseline.
(c) Does the performance scale across cores and processes?
(d) How low can you go with the number of slices until you see an impact from the communication?
Rewrite the code to use OpenMP target offloading and run it with 1 to 8 GPUs. For this you need to modify the compiler command line to include -mp=gpu -gpu=sm_70. (The Makefile.gpu has these options already; modify as appropriate)
(a) How much faster than on the CPUs can you get?
(b) Investigate the impact of the number of teams and threads per team using the appropriate OpenMP clauses.

Part2: Getting fancy

Here we investigate different options for distributing the work between CPUs and GPUs via asynchronous offloading. For this, we provide a baseline code that splits the workload into a GPU and a CPU part according to a command line parameter (div-het.c and div-het.f90). When running the code, you must specify the number of slices and the ratio:

$ mpirun -np 4 ... ./div-het.exe 2000000000 0.3

Experiment with the ratio between GPU and CPU workloads. Look especially at corner cases (GPU only, CPU only).
What ratio is expected to yield the best utilization of both devices?
Implement at least two different variants of asynchronous offloading and investigate whether it works as expected. Look especially at performance variations (i.e., run the program multiple times). Hint: Start simple, with one GPU and one team of threads on the host.

Last modified: Thursday, 12 February 2026, 9:15 AM