OMP2210: 5. Affinity and NUMA | NHR Learning Platform

= Affinity

Under the `affinity` subdirectory you find a very basic OpenMP application in C and Fortran which you can use for this task.

Set environment variable `OMP_DISPLAY_AFFINITY=true` do get information about
on which place each OpenMP thread is allowed to run.

* Linux
** bash: `export OMP_DISPLAY_AFFINITY=true`
** csh: `setenv OMP_DISPLAY_AFFINITY true`
* Windows / cmd: `set OMP_DISPLAY_AFFINTIY=true`

== Affinity Format

Environment variable `OMP_AFFINITY_FORMAT` controls what information is
printed, when `OMP_DISPLAY_AFFINITY` is enabled.

It works like the format syntax for `printf`, but with own fields:

* `%n` / `%{thread_num}`: thread id
* `%N` / `%{num_threads}`: no. of threads in current parallel region
* `%H` / `%{host}`: host name
* `%P` / `%{process_id}`: process id
* `%i` / `%{native_thread_id}`: native thread id
* `%A` / `%{thread_affinity}`: calling thread's affinity

For more specifiers see the OpenMP standard.

Fields can be adjusted in width and padding, e.g. `%0.7n`:

* `0` causes leading zeros to be print
* `.` causes right aligned output,
* `7` sets the field size

Example:

```
export OMP_AFFINITY_FORMAT="OpenMP thread id %.3n affinity %A"
```

== Compile

In order to compile an OpenMP program, you have to use

* gcc, clang, gfortran, flang-new: `-fopenmp affinity.c -o affinity`
* icc, ifort, icx, ifx: `-qopenmp affinity.c -o affinity`

* nvc, nvfortran: `-mp affinity.c -o affinity`

Example:

```
gcc -fopenmp affinity.c -o affinity
```

== Tasks

* Explore topology with lscpu, ...

* Bind each thread to
** separate HW thread (if available)
** separate core
** separate socket

* What happens if there are more threads than places available?

= NUMA

Under the `axpy/numa` subdirectory you find a simple application that implements the `daxpy` kernel.

When compiled the binary takes three optional arguments:

```
./axpy [vector size in bytes] [no. of iterations] [value for a]
```

== Tasks

* Without changes you should be able to reproduce the graph from the slides. Use a vector size of 10GB, i.e.:
```
./axpy $((10 * 1024 * 1024 * 1024))
```

* Fix the serial initialization of the data and measure your code again.

* Run the fixed code with `numactl -m <id of your first NUMA node>` prefixed and observe the bandwidth. Command line would be like:
```
numctl -m <id of your first NUMA node> ./axpy.c.exe $((10 * 1024 * 1024 * 1024))
```

Last modified: Monday, 11 March 2024, 7:32 PM