NLPE_HLRS: Hands-on: The divide instruction

We want to calculate the value of $ \pi $ by numerically integrating a function:

$ \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx $

We use the mid-point rule integration scheme, which works by summing up areas of rectangles centered around $x_i$ with a width of $\Delta x$ and a height of $f(x_i)$:

int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) {
  x = (i+0.5)*delta_x;
  sum += (4.0 / (1.0 + x * x));
}
Pi = sum * delta_x;

You can find example programs in C and Fortran in the DIV folder.

Compile the code with the Intel compiler:

$ module load intel
$ icx -std=c99 -O3 -xHOST -qopt-zmm-usage=high div.c -o div.exe

or:

$ ifx -O3 -xHOST -qopt-zmm-usage=high div.f90 -o div.exe

This compiles the code with the largest possible SIMD width on this CPU (512 bit). To run it, wrap the executable with srun to fix the clock speed:

$ srun --cpu-freq=2400000-2400000:performance ./div.exe

Make sure that your code actually computes an approximation to π, and look at the runtime and performance in MFlops/s as obtained on one core of the cluster.

How many flops per second and per cycle are performed?
Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
Now compile successively with the following options instead of -O3 -xHOST -qopt-zmm-usage=high:
```
-O3 -xAVX

-O3 -xSSE4.2
```
```
-O1 -no-vec
```
These produce AVX, SSE, and scalar code, respectively.

How does the divide throughput change? Did you expect this result?
(Advanced - day 2 or 3) look at the assembly code that the compiler generates, either with

$ objdump -d div.exe | less

or by letting the compiler produce it:

$ icc -std=c99 -O3 -xHOST -qopt-zmm-usage=high -S div.c

In the latter case you will get the assembly in a file named "div.s".

Try to find the main loop of the code. Hint: the floating-point divide instruction follows the pattern "[v]div[s|p]d". What did the compiler do here to make it all work as expected? Why can we see the raw divide throughput at all?

Last modified: Tuesday, 3 June 2025, 1:12 PM