Hands-on: The divide instruction
We want to calculate the value of \( \pi \) by numerically integrating a function:
\( \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx \)
We use the mid-point rule integration scheme, which works by summing
up areas of rectangles centered around \(x_i\) with a width of \(\Delta
x\) and a height of \(f(x_i)\):
You can find example programs in C and Fortran in the ~r14s0001/DIV folder.
int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) { </strong><strong> x = (i+0.5)*delta_x; </strong><strong> sum += (4.0 / (1.0 + x * x)); </strong><strong>} Pi = sum * delta_x;
You can find example programs in C and Fortran in the ~r14s0001/DIV folder.
Copy the complete folder to your home:
$ cp -a ~r14s0001/DIV ~
Compile the code with the Intel compiler:
$ module load intel
$ icx -std=c99 -O3 -xHOST -qopt-zmm-usage=high div.c -o div.exe
or:
$ ifx -O3 -xHOST -qopt-zmm-usage=high div.f90 -o div.exe
This compiles the code with the largest possible SIMD width on this CPU (512 bit).
Make sure that your code actually computes an approximation to
π, and look at the runtime and performance in MFlops/s as obtained on
one core of the cluster. The clock frequency is already fixed at 2.4 GHz.
- How many flops per second and per cycle are performed?
- Assuming that the divide instruction dominates the
runtime of the code (and everything else is hidden behind the divides),
can you estimate the throughput (i.e., the number of operations
per cycle) for a divide operation in CPU cycles?
- Now compile successively with the following options instead of -O3 -xHOST -qopt-zmm-usage=high:
-O3 -xAVX
-O3 -xSSE4.2-O1 -no-vec
These produce AVX, SSE, and scalar code, respectively.
How does the divide throughput change? Did you expect this result?
- (Advanced - day 2 or 3) look at the assembly code that the compiler generates, either with
$ objdump -d div.exe | less
or by letting the compiler produce it:
$ icc -std=c99 -O3 -xHOST -qopt-zmm-usage=high -S div.c
In the latter case you will get the assembly in a file named "div.s".
Try to find the main loop of the code. Hint: the floating-point divide instruction follows the pattern "[v]div[s|p]d". What did the compiler do here to make it all work as expected? Why can we see the raw divide throughput at all?
Last modified: Tuesday, 3 October 2023, 10:46 AM