PPHPS-2026: Hands-On: Hybrid-parallel Pi

A simple integration code

We want to calculate the value of \( \pi \) by numerically integrating a function:

\( \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx \)

We use the mid-point rule integration scheme, which works by summing up areas of rectangles centered around \(x_i\) with a width of \(\Delta x\) and a height of \(f(x_i)\):

size_t SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (size_t i=0; i < SLICES; i++) {
  x = (i+0.5)*delta_x;
  sum += (4.0 / (1.0 + x * x));
}
Pi = sum * delta_x;

You can find copy the example code from the DIV folder. The program takes the number of slices as the first command line argument (the default is 2 billion).

We start with a simple but functional MPI-only code.

(a) Parallelize the integration loop with OpenMP (no offloading) and run it with several combinations of MPI process count and thread count per process. Make sure that the result is still as expected!
(b) What is the best performance in Git/s (billion iterations per second) you can achieve on the CPU cores of a full node?
(c) Does the performance scale across cores and processes?

Last modified: Thursday, 26 February 2026, 1:58 PM