Thread and process binding
You may notice that the performance of a parallel program fluctuates from run to tun, sometimes significantly. This may have several reasons, but one of the main factors is the lack of thread or process binding: If you don't specify where your threads and processes should be running within the compute node, the OS decides for you, which is almost always a bad idea in the HPC context.
A good general overview of the topic can be found in the HPC Wiki at: https://hpc-wiki.info/hpc/Binding/Pinning
Here we want to give you the information you need to implement pinning with the Intel compiler and MPI library on Fritz.
OpenMP thread affinity
There are various options for binding OpenMP threads to cores.
In the OpenMP standard, two environment variables are defined: OMP_PLACES and OMP_PROC_BIND. OMP_PLACES specifies the basic unit for thread pinning. It can take (among other things) the values "trheads", "cores", and "sockets". This means that the "next" thread will always be bound to the "next" entity of the type specified in OMP_PLACES. For example, with OMP_PLACES=cores, OpenMP threads will be attached to cores (one thread per core unless there is oversubscription).
With OMP_PROC_BIND you can specify how the OpenMP threads are assigned to the places. The only relevant setting here are "close" and "spread". With "close", places are filled consecutively from "left to right," whereas with "spread" the OpenMP threads are spread out across the system.
Example for Fritz:
$ OMP_NUM_THREADS=36 OMP_PLACES=cores OMP_PROC_BIND=close ./a.out
starts 36 threads of a.out on the 36 cores of the first socket on the node. On the other hand,
$ OMP_NUM_THREADS=36 OMP_PLACES=cores OMP_PROC_BIND=spread ./a.outstarts 36 threads but places 18 of them on one socket and the other 18 on the other socket. If Hyper-Threading were enabled (it isn't on Fritz), the result would be the same since we specified "cores" and not "threads."
You can also restrict the set of cores by adding a number in parentheses, e.g. OMP_PLACES="cores(18)". This will only consider the first 18 cores of the system and ignore the rest. There are more flexible options for OMP_PLACES, but we will not need them in this course (they are described on the Wiki page above).
MPI process affinity with Intel MPI
There is no standardized affinity mechanism for MPI. Every implementation has its own way of doing this. Intel MPI has a very convoluted scheme that is IMO overly complicated. The easiest way to use it is the I_MPI_PROCESSOR_LIST environment variable:
$ I_MPI_PIN_PROCESSOR_LIST=0-4,36-40 mpirun -np 10 ./a.out
This will run 10 processes and bind them to the first 5 cores on both sockets of a Fritz node.
$ I_MPI_PIN_PROCESSOR_LIST=0,18,36,54 mpirun -np 4 ./a.out
This will spread out 4 processes with maximum spacing in the node.
$ I_MPI_PIN_PROCESSOR_LIST=36-71 mpirun -np 36 ./a.out
This will run 36 processes on the second socket.
If you want to run fewer than 72 processes per node, mpirun has the option "-ppn":
$ I_MPI_PIN_PROCESSOR_LIST=0-9 mpirun -np 20 -ppn 10 ./a.outThis runs 10 processes per node (20 overall, so you must have allocated at least 2 nodes) and binds them to the cores on the first socket of each node.