Hands-on: The divide instruction
First you need to start an interactive job on the Fritz cluster, which you should have done as "pre-homework" already. Here's a quick reminder. After logging into the frontend, you request one cluster node for, e.g., 1 hour of interactive work:
$ salloc -p singlenode -N 1 --time=01:00:00
This gives you a shell on a compute node. We have reservations in
place so it should be no problem getting a node during normal working
hours.
In order to run a binary, you can just start it on the command line as usual. However, if you need to fix the clock frequency then you need to use the srun command:
$ srun --cpu-freq=2000000-2000000:performance ./a.out
Note that the clock speed (min and max values) must be given in kHz.
Now you're good to go. Remember that it's a good idea to keep two
shells open: One for running jobs on a cluster node (see above) and a
second one to do the editing etc. on the frontend. The number of editors
available on the compute nodes is limited. Compilers and other software
modules are available on frontends and cluster nodes alike, but you will only be able to compile code on the frontends!
------------------------------------------------------------------------------------------
We want to calculate the value of \( \pi \) by numerically integrating a function:
int SLICES = 2000000000;
double delta_x = 1.0/SLICES;
for (int i=0; i < SLICES; i++) { </strong><strong> x = (i+0.5)*delta_x; </strong><strong> sum += (4.0 / (1.0 + x * x)); </strong><strong>} Pi = sum * delta_x;
You can find example programs in C and Fortran in the DIV folder.
$ module load intel
$ icx -std=c99 -O3 -xHOST -qopt-zmm-usage=high div.c -o div.exe
$ ifx -O3 -xHOST -qopt-zmm-usage=high div.f90 -o div.exe
- Assuming that the divide instruction dominates the
runtime of the code (and everything else is hidden behind the divides),
can you estimate the inverse throughput (i.e., the number of operations
per cycle) for a divide operation in CPU cycles?
- Now compile successively with the following options instead of -O3 -xHOST -qopt-zmm-usage=high:
-O3 -xSSE4.2
-O1 -no-vec
These produce SSE and scalar code, respectively.
How does the divide throughput change? Did you expect this result?
- (Advanced - day 2 or 3) look at the assembly code that the compiler generates, either with
$ objdump -d div.exe | less
or by letting the compiler produce it:
$ icc -std=c99 -O3 -xHOST -qopt-zmm-usage=high -S div.c
In the latter case you will get the assembly in a file named "div.s".
Try to find the main loop of the code. Hint: the floating-point divide instruction follows the pattern "[v]div[s|p]d". What did the compiler do here to make it all work as expected? Why can we see the raw divide throughput at all?