Hands-on: The divide instruction
We want to calculate the value of \( \pi \) by numerically integrating a function:
\( \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx \)
We
use the mid-point rule integration scheme, which works by summing up
areas of rectangles centered around \(x_i\) with a width of \(\Delta x\)
and a height of \(f(x_i)\):
You can find example programs in C and Fortran in the ~q26z0001/DIV folder.
int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) { x = (i+0.5)*delta_x; sum += (4.0 / (1.0 + x * x)); } Pi = sum * delta_x;
You can find example programs in C and Fortran in the ~q26z0001/DIV folder.
Copy the complete folder to your home:
$ cp -a ~q26z0001/DIV ~
Compile the code with the Intel compiler:
$ module load intel
$ icx -std=c99 -O3 -xHOST -qopt-zmm-usage=high div.c -o div.exe
or:
$ ifx -O3 -xHOST -qopt-zmm-usage=high div.f90 -o div.exe
This compiles the code with the largest possible SIMD width on this CPU (512 bit).
Make
sure that your code actually computes an approximation to π, and look
at the runtime and performance in MFlops/s as obtained on one core of
the cluster. The clock frequency is already fixed at 2.4 GHz.
- How many flops per second and per cycle are performed?
- Assuming
that the divide instruction dominates the runtime of the code (and
everything else is hidden behind the divides), can you estimate the
throughput (i.e., the number of operations per cycle) for a divide
operation in CPU cycles?
- Now compile successively with the following options instead of -O3 -xHOST -qopt-zmm-usage=high:
-O3 -xAVX
-O3 -xSSE4.2-O1 -no-vec
These produce AVX, SSE, and scalar code, respectively.
How does the divide throughput change? Did you expect this result?
- (Advanced - day 2 or 3) look at the assembly code that the compiler generates, either with
$ objdump -d div.exe | less
or by letting the compiler produce it:
$ icc -std=c99 -O3 -xHOST -qopt-zmm-usage=high -S div.c
In the latter case you will get the assembly in a file named "div.s".
Try to find the main loop of the code. Hint: the floating-point divide instruction follows the pattern "[v]div[s|p]d". What did the compiler do here to make it all work as expected? Why can we see the raw divide throughput at all?
Last modified: Monday, 17 June 2024, 9:57 PM