We want to calculate the value of \( \pi \) by numerically integrating a function:

\( \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx \)

We use the mid-point rule integration scheme, which works by summing up areas of rectangles centered around \(x_i\) with a width of \(\Delta x\) and a height of \(f(x_i)\):

int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) { x = (i+0.5)*delta_x; sum += (4.0 / (1.0 + x * x)); } Pi = sum * delta_x;

You can find example programs in C and Fortran in the DIV folder. The file job-marvin.sh contains a script that compiles and runs the code for you (you can choose between C and Fortran by editing the script). The clock speed is set to 2.0 GHz. 
 
Make sure that the code actually computes an approximation to π, and look at the runtime as obtained on one core of the cluster. 
 
  1. How many loop iterations per cycle are performed? Since the divide operations dominate the runtime, this is also the throughput (i.e., the number of operations per cycle) for a divide operation on this CPU with the AVX-512 instruction set.

  2. Now compile successively with the following options instead of "-O3 -xHOST -qopt-zmm-usage=high":

    -O3 -xAVX2

    -O3 -xSSE4.2
    -O1 -no-vec
    These produce AVX, SSE, and scalar code, respectively.

    How does the divide throughput change? Did you expect this result?





Last modified: Friday, 13 March 2026, 2:06 PM