NLPE-Ansys: Hands-on: The divide instruction

First you need to start an interactive job on the Fritz cluster, which you should have done as "pre-homework" already. Here's a quick reminder. After logging into the frontend, you request one cluster node for, e.g., 1 hour of interactive work:

$ salloc -p singlenode -N 1 --time=01:00:00

This gives you a shell on a compute node. We have reservations in place so it should be no problem getting a node during normal working hours.

In order to run a binary, you can just start it on the command line as usual. However, if you need to fix the clock frequency then you need to use the srun command:

$ srun --cpu-freq=2000000-2000000:performance ./a.out

Note that the clock speed (min and max values) must be given in kHz.

Now you're good to go. Remember that it's a good idea to keep two shells open: One for running jobs on a cluster node (see above) and a second one to do the editing etc. on the frontend. The number of editors available on the compute nodes is limited. Compilers and other software modules are available on frontends and cluster nodes alike, but you will only be able to compile code on the frontends!

------------------------------------------------------------------------------------------

We want to calculate the value of $ \pi $ by numerically integrating a function:

$ \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx $

We use a very simple rectangular integration scheme that works by summing up areas of rectangles centered around $x_i$ with a width of $\Delta x$ and a height of $f(x_i)$:

int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) {
</strong><strong>  x = (i+0.5)*delta_x;
</strong><strong>  sum += (4.0 / (1.0 + x * x));
</strong><strong>}
Pi = sum * delta_x;

You can find example programs in C and Fortran in the DIV folder.

Compile the code with the Intel compiler:

$ module load intel
$ icx -std=c99 -O3 -xHOST -qopt-zmm-usage=high div.c -o div.exe

or:

$ ifx -O3 -xHOST -qopt-zmm-usage=high div.f90 -o div.exe

This compiles the code with the largest possible SIMD width on this CPU (512 bit).

Make sure that your code actually computes an approximation to π, and look at the runtime and performance in MFlops/s as obtained on one core of the cluster at a fixed clock speed of 2 GHz. How many flops per cycle are performed?

Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the inverse throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
Now compile successively with the following options instead of -O3 -xHOST -qopt-zmm-usage=high:
```
-O3 -xSSE4.2
```
```
-O1 -no-vec
```
These produce SSE and scalar code, respectively.

How does the divide throughput change? Did you expect this result?
(Advanced - day 2 or 3) look at the assembly code that the compiler generates, either with

$ objdump -d div.exe | less

or by letting the compiler produce it:

$ icc -std=c99 -O3 -xHOST -qopt-zmm-usage=high -S div.c

In the latter case you will get the assembly in a file named "div.s".

Try to find the main loop of the code. Hint: the floating-point divide instruction follows the pattern "[v]div[s|p]d". What did the compiler do here to make it all work as expected? Why can we see the raw divide throughput at all?

Last modified: Monday, 28 November 2022, 10:19 AM