Hands-on: The divide instruction
Completion requirements
We want to calculate the value of \( \pi \) by numerically integrating a function:
\( \displaystyle\pi=\int\limits_0^1\frac{4}{1+x^2}\,\mathrm dx \)
We use the mid-point rule integration scheme, which works by summing up areas of rectangles centered around \(x_i\) with a width of \(\Delta x\) and a height of \(f(x_i)\):
You can find example programs in C and Fortran in the DIV folder. Compile the code with the Intel compiler:
int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) { </strong><strong> x = (i+0.5)*delta_x; </strong><strong> sum += (4.0 / (1.0 + x * x)); </strong><strong>} Pi = sum * delta_x;
You can find example programs in C and Fortran in the DIV folder. Compile the code with the Intel compiler:
$ module load intel
$ icx -std=c99 -O3 -xHOST -qopt-zmm-usage=high div.c -o div.exe
or:
$ ifx -O3 -xHOST -qopt-zmm-usage=high div.f90 -o div.exe
This compiles the code with the largest possible SIMD width on this CPU (512 bit).
Make sure that your code actually computes an approximation to π, and look at the runtime and performance in MFlops/s as obtained on one core of the cluster. In order to set the clock frequency to a specific value, you need to wrap the command to run with srun and a clock frequency flag:
$ srun --cpu-freq=2400000-2400000:performance ./div.exeThis sets the clock speed to its base value of 2.4 GHz.
- How many flops per second and per cycle are performed?
- Assuming that the divide instruction dominates the runtime of the code (and everything else is hidden behind the divides), can you estimate the throughput (i.e., the number of operations per cycle) for a divide operation in CPU cycles?
- Now compile successively with the following options instead of -O3 -xHOST -qopt-zmm-usage=high:
-O3 -xAVX
-O3 -xSSE4.2-O1 -no-vec
These produce AVX, SSE, and scalar code, respectively.
How does the divide throughput change? Did you expect this result? - (Advanced - day 2 or 3) look at the assembly code that the compiler generates, either with
$ objdump -d div.exe | less
or by letting the compiler produce it:
$ icc -std=c99 -O3 -xHOST -qopt-zmm-usage=high -S div.c
In the latter case you will get the assembly in a file named "div.s".
Try to find the main loop of the code. Hint: the floating-point divide instruction follows the pattern "[v]div[s|p]d". What did the compiler do here to make it all work as expected? Why can we see the raw divide throughput at all?
Last modified: Tuesday, 3 December 2024, 11:46 AM