# axpy benchmark

The example implements a simple OpenMP parallelized axpy kernel.  The kernel
is called with a previously allocated arrays, each kernel call is timed and
the resulting duration and bandwidth is reported.

The benchmark usage is:

```bash
./axpy [<array size in bytes> [<no. of iterations>]]
```

* The total workingset of the benchmark is 2 x the array size.
* The `<no. of iterations>` determins how often the axpy kernel is called.

NOTE: If you need to use axpy, always use an optimized implementation typically
      provided by a BLAS library.


## First touch

On machines with multiple NUMA locality domains (NUMA LDs) first touch plays a
critical role when intializing shared data.  The first core that writes to
newly allocated variable or array determines in which NUMA LD it is placed.  By
default it is the NUMA LD closesed to the core.


## Compile

In order to compile an OpenMP program, you have to use

* gcc, clang, gfortran, flang-new: `-fopenmp axpy.c -o axpy`
* icc, ifort, icx, ifx: `-qopenmp axpy.c -o axpy`
* nvc, nvfortran: `-mp axpy.c -o axpy`

Example:

```bash
gcc -fopenmp axpy.c -o axpy
```


## Tasks

1. Explore the NUMA LD topology of your machine.  Use the corresponding tools
   to get the number of NUMA LDs and the cores associated with them.

2. Measure the memory bandwidth with the axpy benchmark when using the cores of
   1 and 2 NUMA LDs. Use an array size (first argument to the benchmark in
   bytes) that is at least twice as large as the last level cache (better
   larger).

    * Complile the axpy benchmark.
    * Use thread affinity settings to bind the OpenMP threads accordingly.
    * Observe the reported memory bandwidth when increasing the number of cores
      from one NUMA LD to two.
    * Set the number the axpy kernel is called to 200 and rerun it on two NUMA
      LDs.  Observe the bandwidth.

3. Fix the issue the axpy benchmark has and remeassure the memory bandwidth
   with two NUMA LDs.
