5. Affinity and NUMA
= Affinity
Under the `affinity` subdirectory you find a very basic OpenMP application in C and Fortran which you can use for this task.
on which place each OpenMP thread is allowed to run.
* Linux
** bash: `export OMP_DISPLAY_AFFINITY=true`
** csh: `setenv OMP_DISPLAY_AFFINITY true`
* Windows / cmd: `set OMP_DISPLAY_AFFINTIY=true`
== Affinity Format
Environment variable `OMP_AFFINITY_FORMAT` controls what information is
printed, when `OMP_DISPLAY_AFFINITY` is enabled.
It works like the format syntax for `printf`, but with own fields:
* `%n` / `%{thread_num}`: thread id
* `%N` / `%{num_threads}`: no. of threads in current parallel region
* `%H` / `%{host}`: host name
* `%P` / `%{process_id}`: process id
* `%i` / `%{native_thread_id}`: native thread id
* `%A` / `%{thread_affinity}`: calling thread's affinity
For more specifiers see the OpenMP standard.
Fields can be adjusted in width and padding, e.g. `%0.7n`:
* `0` causes leading zeros to be print
* `.` causes right aligned output,
* `7` sets the field size
Example:
```
export OMP_AFFINITY_FORMAT="OpenMP thread id %.3n affinity %A"
```
== Compile
In order to compile an OpenMP program, you have to use
* gcc, clang, gfortran, flang-new: `-fopenmp affinity.c -o affinity`
* icc, ifort, icx, ifx: `-qopenmp affinity.c -o affinity`
* nvc, nvfortran: `-mp affinity.c -o affinity`
Example:```
gcc -fopenmp affinity.c -o affinity
```
== Tasks
* Explore topology with lscpu, ...
** separate HW thread (if available)
** separate core
** separate socket
* What happens if there are more threads than places available?
= NUMA
Under the `axpy/numa` subdirectory you find a simple application that implements the `daxpy` kernel.
When compiled the binary takes three optional arguments:
```
./axpy [vector size in bytes] [no. of iterations] [value for a]
```
== Tasks
* Without changes you should be able to reproduce the graph from the slides. Use a vector size of 10GB, i.e.:
```
./axpy $((10 * 1024 * 1024 * 1024))
```
* Fix the serial initialization of the data and measure your code again.
* Run the fixed code with `numactl -m <id of your first NUMA node>` prefixed and observe the bandwidth. Command line would be like:
```
numctl -m <id of your first NUMA node> ./axpy.c.exe $((10 * 1024 * 1024 * 1024))
```