



# **Hybrid Programming in HPC – MPI+X**

**Claudia Blaas-Schenner1) Georg Hager2) Rolf Rabenseifner3) claudia.blaas-schenner@tuwien.ac.at georg.hager@fau.de rabenseifner@hlrs.de**

<http://tiny.cc/MPIX-VSC>

the exercises

<http://tiny.cc/MPIX-LRZ>

1) VSC Research Center, TU Wien, Vienna, Austria (hands-on labs) <sup>2)</sup> Erlangen National High Performance Computing Center (NHR@FAU), FAU, Germany 3) High Performance Computing Center (HLRS), University of Stuttgart, Germany

**PTC ONLINE COURSE @ VSC Vienna, Dec 12-14, 2022**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### Warmup survey

- For quizzes and surveys,
	- Keep a browser tab open on [https://menti.com](https://menti.com/)

Links is also in Moodle

- To join the quizzes and surveys, enter the number given in the menti.com screen share on the top of the screen
- Alternatively, click on the link in the Zoom chat
- $\blacksquare$  Have fun  $\lightharpoonup$ )

### **General outline**

#### **Introduction**

#### **Programming Models [\(13\)](#page-12-0)**

- $\blacksquare$  MPI + OpenMP on multi/many-core  $(14)$  + Exercises
- $\blacksquare$  MPI + Accelerators [\(99\)](#page-98-0)
- $\blacksquare$  MPI + MPI-3 shared memory  $(115)$  + Exercise  $(143)$
- Pure MPI communication [\(175\)](#page-174-0)

**Conclusions** (Summary [\(232\),](#page-231-0) Acknowledgements [\(238\)](#page-237-0), Conclusions [\(239\)](#page-238-0))

**Appendix** [\(240\)](#page-239-0) **(**Abstract [\(240\),](#page-240-0) Authors [\(240\)](#page-241-0), Solutions [\(245\)](#page-244-0)**)**

Hybrid Programming – MPI+X Programming models 3/239

### **Introduction**

**Hardware and programming models Hardware Bottlenecks Questions addressed in this tutorial Remarks on Cost-Benefit Calculation**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# Hardware and programming models



- MPI + threading
	- OpenMP
	- $\overline{\phantom{a}}$  Cilk(+)
	- **TBB (Threading Building Blocks)**
- MPI + MPI shared memory
- MPI + accelerator
	- OpenACC
	- OpenMP accelerator support
	- CUDA
	- OpenCL, Kokkos, SYCL,...

#### Pure MPI communication

### Options for running code on multicore clusters



- Which programming model is fastest?
	- MPI everywhere?

Fully hybrid



• Something between? (Mixed model)

…

MPI & OpenMP?



 $\mathcal{L}_{\mathcal{A}}$ 

**?** • Often hybrid programming **slower** than pure MPI – Examples, Reasons,

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Motivation  $\rightarrow$  Options for running code 6/239

### More Options with accelerators



#### Hierarchical hardware

Many levels

#### Hierarchical parallel programming

- Many options for MPI+X: one MPI process per
	- node
	- CPU
	- ccNUMA domain
	- […]
	- core
	- hyper-thread



Hybrid Programming – MPI+X  $\rightarrow$  Motivation  $\rightarrow$  Options for running code 7/239

Where is the

#### Dual-CPU ccNUMA + accelerator node architecture

#### Actual topology of a modern compute node



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

ш

### Hardware bottlenecks

- Multicore cluster
	- Computation
	- Memory bandwidth
	- Intra-CPU communication (i.e., core-to-core)
	- Intra-node communication (i.e., CPU-to-CPU)
	- Inter-node communication
- Cluster with CPU+Accelerators
	- Within the accelerator
		- Computation
		- Memory bandwidth
		- Core-to-Core communication
	- Within the CPU and between the CPUs
		- See above
	- Link between CPU and accelerator



## Example: Hardware bottlenecks in SpMV

- Sparse matrix-vector-multiply with stored matrix entries
	- Bottleneck: memory bandwidth of each CPU
- SpMV with calculated matrix entries (many complex operations per entry)
	- Bottleneck: computational speed of each core
- SpMV with highly scattered matrix entries
	- Bottleneck: Inter-node communication





### Questions addressed in this tutorial

- What is the performance impact of system topology?
- How do I map my programming model on the system to my advantage?
	- $\blacksquare$  How do I do the split into MPI $+X$ ?
	- Where do my processes/threads run? How do I take control?
	- Where is my data?
	- **How can I minimize communication overhead?**
- How does hybrid programming help with typical HPC problems?
	- Can it reduce communication overhead?
	- Can it reduce replicated data?
- How can I leverage multiple accelerators?
	- What are typical challenges?

#### Remarks on Cost-Benefit Calculation

#### Costs – for optimization effort

- e.g., additional OpenMP parallelization
- e.g., 3 person month x 5,000  $\epsilon$  = -15,000  $\epsilon$  (full costs)

#### Benefit – from reduced CPU utilization

- e.g., Example 1: **100,000 € hardware costs** of the cluster
	- x 20% used by this application over whole lifetime of the cluster
	- x 7% performance win through the optimization
	- = +1,400 € **total loss = 13,600 €**
- e.g., Example 2: **10 Mio € system** x 5% used x 8% performance win = +40,000 € **total win = 25,000 €**

#### **Question: Do you want to spend work hours without a final benefit?**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Introduction  $\rightarrow$  Cost-Benefit Calculation



# Programming models

- **MPI + OpenMP on multi/many-core + Exercise**
- **MPI + MPI-3.0 shared memory + Exercise**
- **Pure MPI communication + Exercise**
- **MPI + Accelerators**

<span id="page-12-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models

# **Programming models - MPI + OpenMP**



<span id="page-13-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas -Schenner (VSC, TU Wien)

# **Programming models - MPI + OpenMP**

#### **General considerations**

**> General considerations** How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

<span id="page-14-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  How to compile, link, and run

## Potential advantages of MPI+OpenMP

#### Simple level

- Leverage additional levels of parallelism
	- Scaling to higher number of cores
	- Adding OpenMP with incremental additional parallelization

#### **Enable flexible load balancing on OpenMP level**

- Fewer MPI processes leave room for assigning workload more evenly
- MPI processes with higher workload could employ more threads
- Cheap OpenMP load balancing (tasking, dynamic/guided loops)

#### **Lower communication overhead (possibly)**

- **Few "fat" MPI processes vs many "skinny" processes**
- Fewer messages and smaller amount of data communicated
- **EXECT** Lower memory requirements due to fewer MPI processes
	- Reduced amount of application halos & replicated data
	- Reduced size of MPI internal buffer space

#### Advanced level

Explicit communication/computation overlap

### MPI + any threading model

#### Special MPI init for multi-threaded MPI processes is required:



• returned **thread\_level\_provided** may be less or more than **thread\_level\_required**

**if (thread\_level\_provided < thread\_level\_required) MPI\_Abort(…);**

<sup>1)</sup> Main thread = thread that called MPI Init\_thread. Recommendation: Start MPI\_Init\_thread from OpenMP master thread  $\rightarrow$  OpenMP master = MPI main thread

recommended directly after MPI\_Init\_thread

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  General considerations

# Hybrid MPI+OpenMP masteronly style

```
for (iterations) {
  #pragma omp parallel 
     numerical code
  /*end omp parallel */
  /* on master only */
     MPI_Isend();
     MPI_Irecv();
     MPI_Waitall();
} /* end for loop */
```
#### Advantages

- Simplest possible hybrid model
- Thread-parallel execution and MPI communication strictly separate
- Minimally required MPI thread support level: **MPI\_THREAD\_FUNNELED**

#### Major Problems

 All other threads are sleeping while master thread communicates!

masteronly style: MPI only outside of parallel regions

 Only one thread per process communicating  $\rightarrow$  possible underutilization of network bandwidth

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  General considerations

# Masteronly style within large parallel region

```
#pragma omp parallel 
for(iterations) {
  #pragma omp for
```

```
for(i=0; ...) {
 // ... numerics
} // barrier here
#pragma omp single
{
  MPI_Isend();
  MPI_Irecv();
  MPI_Waitall();
} // Barrier here
```

```
} /* end iter loop */
```
- Barrier before MPI required
	- May be implicit
	- Prevent race conditions on communication buffer data
		- Between multi-threaded numerics
		- and MPI access by master thread
	- Enforce flush of variables
- **Barrier after MPI required** 
	- May be implicit
	- Numerical loop(s) may need communicated data

# **Programming models - MPI + OpenMP**

#### **How to compile, link, and run**

General considerations

- **> How to compile, link, and run** Hands-on: Hello hybrid!
	- System topology, ccNUMA, and memory bandwidth
		- Memory placement on ccNUMA systems
	- Topology and affinity on multicore
		- Hands-on: Pinning
- Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi
- Overlapping communication and computation Communication overlap with OpenMP taskloops
	- Hands-on: Taskloop-based hybrid Jacobi
- Main advantages, disadvantages, conclusions

#### <span id="page-19-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## How to compile, link and run

- Use appropriate OpenMP compiler switch (-openmp, -fopenmp, -mp, -qsmp=openmp, …) and MPI compiler script (if available)
- Link with MPI library
	- Usually wrapped in MPI compiler script
	- If required, specify to link against thread-safe MPI library
		- Often automatic when OpenMP or auto-parallelization is switched on
- Running the code
	- Highly non-portable consult system docs (if available…)
	- Figure out how to start fewer MPI processes than cores per node
	- Pinning (who is running where?) is extremely important  $\rightarrow$  see later

ш

### Compiling from a single source

#### Make use of pre-defined symbols

```
#ifdef _OPENMP # _OPENMP defined with -qopenmp
      // all that is special for OpenMP
#endif
#ifdef USE_MPI # USE_MPI defined with -DUSE_MPI
      // all that is special for MPI
#endif
#ifdef USE_MPI
      MPI_Init(...);
      MPI_Comm_rank(..., &rank);
      MPI_Comm_size(..., &size);
#else # recommended for non-MPI
      rank = 0;
      size = 1;
#endif
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  How to compile, link, and run 22  $\rightarrow$  2008 and run 22/239

### Compiling from a single source





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  How to compile, link, and run

 $\mathcal{L}_{\mathcal{A}}$ 

### Examples for compilation and execution

- Cray XC40 (2 NUMA domains w/ 12 cores each):
	- **ftn -h omp ...**
	- **OMP\_NUM\_THREADS=12 aprun -n 4 -N 2 \ -d \$OMP\_NUM\_THREADS ./a.out**
- Intel Ice Lake (36-core 2-socket) cluster, Intel MPI/OpenMP
	- **mpiifort -qopenmp ...**
	- **mpirun –ppn 2 –np 4 \**

**-env OMP\_NUM\_THREADS 36** 

**-env I\_MPI\_PIN\_DOMAIN socket \**

- **-env KMP\_AFFINITY scatter ./a.out**
- Intel Ice Lake (36-core 2-socket) cluster, Intel MPI/OpenMP + likwid-mpirun
	- **likwid-mpirun –np 4 –pin S0:0-35\_S1:0-35 ./a.out**

### Learn about node topology

- A collection of tools is available
	- **numactl --hardware** (numatools)
	- **lstopo --no-io** (part of hwloc)
	- **cpuinfo –A** (part of Intel MPI)
	- **likwid-topology** (part of LIKWID tool suite [http://tiny.cc/LIKWID\)](http://tiny.cc/LIKWID)

```
$ likwid-topology -c -g
---------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
CPU type: Intel Xeon IvyBridge EN/EP/EX processor
CPU stepping: 4
        *********************************************************************************
Hardware Thread Topology
*********************************************************************************
Sockets: 2
Cores per socket: 8
Threads per core: 2
[... Some output omitted ...]
```


Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  How to compile, link, and run

### Learning about node topology



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  How to compile, link, and run

### Learning about node topology



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming - MPI+X → Programming models → MPI + OpenMP → How to compile, link, and run

# **Programming models - MPI + OpenMP**

#### **Hands-On #1** General considerations

#### <span id="page-27-0"></span>**Hello hybrid!**

- How to compile, link, and run
- **> Hands-on: Hello hybrid!**
- System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore **Hands-on: Pinning** Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

he-hy - Hello Hybrid! - compiling, starting

- 1. FIRST THINGS FIRST PART 1: find out about a (new) cluster login node
- 2. FIRST THINGS FIRST PART 2: find out about a (new) cluster batch jobs
- 3. MPI+OpenMP: :**TODO**: how to compile and start an application how to do conditional compilation
- 4. MPI+OpenMP: :**TODO**: get to know the hardware needed for pinning

# **Programming models - MPI + OpenMP**

#### **System topology, ccNUMA, and memory bandwidth**

- General considerations How to compile, link, and run Hands-on: Hello hybrid! **> System topology, ccNUMA, and memory bandwidth** Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi
	- Main advantages, disadvantages, conclusions

#### <span id="page-29-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

### What is "topology"?

Where in the machine does core (or hardware thread) #n reside?



Why is this important?

- **Resource sharing (cache,** data paths)
- **Communication efficiency** (shared vs. separate caches, buffer locality)
- **Memory access locality** (ccNUMA!)

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



**VSC-3**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

 $\blacksquare$ 

#### Fat-tree Design VSC-3: **dual rail Intel QDR-80 = 3-level fat-tree (BF: 2:1 / 4:1)**

VSC-3: below numbers only, schematic figure

non-blocking: BF 1:1 blocking: BF down- : up-links introduces a latency:

packets that would otherwise follow separate paths would eventually have to wait



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

skipped

**VSC-3: 1 node** Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance **VSC-3: 1 node** 33/239

 $\mathbf{r}$ 

# Ping-Pong Benchmark – Latency

#### Intra-node vs. inter-node on VSC-3

- $\blacksquare$  nodes = 2 sockets (Intel Ivy Bridge) with 8 cores  $+$  2 HCAs
- inter-node = IB fabric = dual rail Intel QDR-80 = 3-level fat-tree (BF: 2:1 / 4:1)

```
)<br>tII<br>tir
myID = get_process_ID()
if(myID.eq.0) then
  \text{targetID} = 1S = qet walltime()call Send_message(buffer,N,targetID)
  call Receive_message(buffer,N,targetID)
  E = qet walltime()GBYTES = 2*N/(E-S)/1.d9 ! Gbyte/s rate
  TIME = (E-S)/2*1.d6 ! transfer time
else
  \text{targetID} = 0call Receive_message(buffer,N,targetID)
  call Send_message(buffer,N,targetID)
endif
```




 $\rightarrow$  Avoiding slow data paths is the key to most performance optimizations!

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

 $\mathcal{L}_{\mathcal{A}}$ 

#### Ping-Pong 1-on-1 Benchmark – Effective Bandwidth



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

# Multiple communicating rings

Benchmark halo irecy send multiplelinks toggle.c

- Varying message size,
- number of *communication cores per CPU*, and
- **four communication schemes** (example with 5 *communicating cores per CPU*)

See HLRS online courses **http://www.hlrs.de/training/self-study-materials**  $\rightarrow$  Practical  $\rightarrow$  MPI.tar.gz  $\rightarrow$  subdirectory MPI/course/C/1sided/



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance
# Duplex accumulated ring bandwidth per node



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



37/239

# Duplex ring bandwidth per core



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# Accumulated – scaling vs. asymptotic behavior



# OpenMP barrier synchronization cost

Comparison of barrier synchronization cost with increasing number of threads

- 2x Haswell 14-core (CoD mode)
- Optimistic measurements (repeated 1000s of times)
- No impact from previous activity in cache
- $\rightarrow$  Barrier sync time highly dependent on system topology & OpenMP runtime implementation



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### Accumulated bandwidth saturation vs. # cores



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# **Programming models - MPI + OpenMP**

## **Memory placement on ccNUMA systems**

General considerations How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth **> Memory placement on ccNUMA systems** Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

# A short introduction to ccNUMA

- ccNUMA:
	- whole memory is transparently accessible by all processors
	- but physically distributed
	- with varying bandwidth and latency
	- and potential contention (shared memory paths)
	- Memory placement occurs with OS page granularity (often 4 KiB)



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

ш

### How much bandwidth does non-local access cost?

■ Example: AMD "Naples" 2-socket system (8 chips, 2 sockets, 48 cores): STREAM Triad bandwidth measurements [Gbyte/s] **Aloulaw** 



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

**Alowaw** 

# Avoiding locality problems

- How can we make sure that memory ends up where it is close to the CPU that uses it?
	- See next slides (first-touch initialization)
- How can we make sure that it stays that way throughout program execution?
	- See later in the tutorial (pinning)

### ■ Taking control is the key strategy!

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## Solving Memory Locality Problems: First Touch

"Golden Rule" of ccNUMA:

**A memory page gets mapped into the local memory of the processor that first touches it!**



- Consequences
	- Process/thread-core affinity is decisive!
	- With OpenMP, data initialization code becomes important even if it takes little time to execute ("parallel first touch")
	- Parallel first touch is automatic for pure MPI
	- If thread team does not span across NUMA domains, memory mapping is not a problem

### Automatic page migration may help if memory is used long enough

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance 46/239

## Solving Memory Locality Problems: First Touch

"Golden Rule" of ccNUMA:

**A memory page gets mapped into the local memory of the processor that first touches it!** Inportal

- Except if there is not enough local memory available
- Some OSs allow to influence placement in more direct ways
	- $\rightarrow$  libnuma (Linux)
- Caveat: "touch" means "write," not "allocate" or "read"

Example:

```
double *huge = (double*)malloc(N*sizeof(double));
// memory not mapped yet
for(i=0; i<N; i++) // or i+=PAGE_SIZE
  huge[i] = 0.0; // mapping takes place here!
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

**T** 

# Most simple case: explicit initialization

**integer,parameter :: N=10000000 double precision A(N), B(N) A=0.d0 !\$OMP parallel do**  $d$ **o**  $i = 1, N$  $B(i) = function (A(i))$ **end do !\$OMP end parallel do**

```
integer,parameter :: N=10000000
double precision A(N),B(N)
!$OMP parallel 
!$OMP do schedule(static)
do i = 1, NA(i)=0.d0
end do
!$OMP end do
...
!$OMP do schedule(static)
do i = 1, NB(i) = function (A(i))end do
!$OMP end do
!$OMP end parallel
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and performance

**T** 

# Handling ccNUMA in practice

- Solution A
	- One (or more) MPI process(es) per ccNUMA domain
	- Pro: optimal page placement (perfectly local memory access) for free
	- Con: higher number (>1) of MPI processes on each node
- Solution B
	- One MPI process per node or one MPI process spans multiple ccNUMA domains
	- Pro: Smaller number of MPI processes compared to Solution A
	- Cons:
		- Explicitly parallel initialization needed to "bind" the data to each ccNUMA domain  $\rightarrow$  otherwise loss of performance
		- Dynamic/guided schedule or tasking  $\rightarrow$  loss of performance
- Thread binding is mandatory for A and B! Never trust the defaults!

### Conclusions from the observed topology effects

- Know your hardware characteristics:
	- Hardware topology (use tools such as likwid-topology)
	- Typical hardware bottlenecks
		- These are independent of the programming model!
	- Hardware bandwidths, latencies, peak performance numbers
- Know your software characteristics
	- Typical numbers for communication latencies, bandwidths
	- Typical OpenMP overheads
- **Learn how to take control** 
	- See next chapter on affinity control
- Leveraging topology effects is a part of code optimization!

**The State** 

# **Programming models - MPI + OpenMP**

### **Topology and affinity on multicore**

General considerations How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems **> Topology and affinity on multicore** Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# Thread/Process Affinity ("Pinning")

- **Highly OS-dependent system calls** 
	- But available on all OSs
	- Non-portable
- Support for user-defined pinning for OpenMP threads in all compilers
	- Compiler specific
	- Standardized in OpenMP (places)
	- Generic Linux: **taskset**, **numactl, likwid-pin**
- Affinity awareness in all MPI libraries
	- Not defined by the MPI standard (as of 4.0)
	- Necessarily non-portable feature of the startup mechanism (mpirun, …)
- Affinity awareness in batch scheduler
	- Batch scheduler must work with MPI + OpenMP affinity
	- Difficult, non-portable, every combination is different

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and affinity on multicore

**The Second** 

# Anarchy vs. affinity with OpenMP STREAM



- Making use of architectural features
- Avoiding resource contention

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and affinity on multicore

50  $\blacksquare$ 

(fill first CMG first)

40

30

200 100

10

20

# cores



- Binds threads to specific cores without touching code
- Directly supports pthreads, gcc OpenMP, Intel OpenMP
- Allows user to specify "skip mask" (i.e., supports many different compiler/MPI combinations)
- Replacement for **taskset**
- Uses logical (contiguous) core numbering when running inside a restricted set of cores
- Supports logical core numbering inside node, socket, core
- Usage examples:

**env OMP\_NUM\_THREADS=6 likwid-pin -c 0-2,4-6 ./myApp parameters** 

**likwid-pin –c S0:0-2@S1:0-2 ./myApp**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### OMP PLACES and Thread Affinity (see OpenMP-4.0 page 7 lines 29-32, p. 241-243)



#### Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### OMP\_PROC\_BIND variable / proc\_bind() clause

### Determines how places are used for pinning:



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

nes

# Some simple OMP\_PLACES examples

- Intel Xeon w/ SMT, 2x36 cores, 1 thread per physical core, fill 1 socket **OMP\_NUM\_THREADS=36 OMP\_PLACES=cores OMP\_PROC\_BIND=close**
- Intel Xeon Phi with 72 cores,

### 32 cores to be used, 2 threads per physical core

**OMP\_NUM\_THREADS=64 OMP\_PLACES=cores(32) OMP\_PROC\_BIND=close # spread will also do**

#### Intel Xeon, 2 sockets, 4 threads per socket (no binding within socket!) **OMP\_NUM\_THREADS=8 OMP\_PLACES=sockets OMP\_PROC\_BIND=close # spread will also do**

### Intel Xeon, 2 sockets, 4 threads per socket, binding to cores

**OMP\_NUM\_THREADS=8 OMP\_PLACES=cores OMP\_PROC\_BIND=spread**

Always prefer abstract places instead of HW thread IDs!

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and affinity on multicore

 $\mathcal{L}_{\mathcal{A}}$ 

## OpenMP places and proc\_bind (see OpenMP-4.0 pages 49f, 239, 241-243)

setenv OMP PLACES " $\{0\}$ ,  $\{1\}$ ,  $\{2\}$ , …  $\{29\}$ ,  $\{30\}$ ,  $\{31\}$ " or setenv OMP PLACES threads (example with P=32 places)

skipped

- setenv OMP NUM THREADS "8, 2, 2" setenv OMP\_PROC\_BIND "spread,spread,close" =
- Master thread encounters nested parallel regions: #pragma omp parallel  $\rightarrow$  uses: num\_threads(8) proc\_bind(spread) #pragma omp parallel  $\rightarrow$  uses: num threads(2) proc bind(spread) #pragma omp parallel  $\rightarrow$  uses: num threads(2) proc\_bind(close)



- spread: Sparse distribution of the 8 threads among the 32 places; partitioned place lists.
- close: New threads as close as possible to the parent's place; same place lists.
- master: All new threads at the same place as the parent.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## Goals behind OMP\_PLACES and proc\_bind



Examples should be independent of vendor's numbering!

- Without nested parallel regions: #pragma omp parallel num\_threads( $4*6$ ) proc\_bind(spread)  $\rightarrow$  one thread per core
- With nested regions:

Skipped

#pragma omp parallel num threads(4) proc\_bind(spread)  $\rightarrow$  one thread per socket #pragma omp parallel num\_threads(6) proc\_bind(spread)  $\rightarrow$  one thread per core #pragma omp parallel num threads(2) proc\_bind(close)  $\rightarrow$  one thread per hyper-thread

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# Pinning of MPI processes

- Highly system dependent!
- Intel MPI: env variable **I MPI PIN DOMAIN**
- OpenMPI: choose between several mpirun options, e.g., -bind-to-core, -bind-to-socket, -bycore, -byslot …
- Cray's aprun: pinning by default

 Platform-independent tools: likwid-mpirun (likwid-pin, numactl)

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## Anarchy vs. affinity with a heat equation solver



Reasons for caring about affinity:

- **Eliminating performance variation**
- Making use of architectural features
- Avoiding resource contention



2x 10-core Intel Ivy Bridge, OpenMPI



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and affinity on multicore 61/239

 $\blacksquare$ 



Topology ("mapping") with MPI+OpenMP: *Lots of choices – solutions are highly system specific!* 

One MPI process per node

One MPI process per socket

OpenMP threads pinned "round robin" across cores in node

Two MPI processes per socket

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner



Hybrid Programming – MPI+X → Programming models → MPI + OpenMF Conclusions and after the MPI+Accelerator

Tools

### likwid-mpirun: *1 MPI process per node*

**likwid-mpirun –np 2 -pin N:0-11 ./a.out**



**OMP\_NUM\_THREADS=12 mpirun –ppn 1 –np 2 –env KMP\_AFFINITY scatter ./a.out**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Intel MPI+compiler:

Hybrid Programming – MPI+X → Programming models → MPI + OpenMP → Topology and affinity on multicore

### likwid-mpirun: *1 MPI process per socket*

**likwid-mpirun –np 4 –pin S0:0-5\_S1:0-5 ./a.out**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + OpenMP → Topology and affinity on multicore

### MPI/OpenMP affinity: Take-home messages

- **EXECT** Learn how to take control of hybrid execution!
	- Almost all performance features depend on topology and thread placement! (especially if SMT/Hyperthreading is on)
- Always observe the topology dependence of
	- Intranode MPI performance
	- OpenMP overheads
	- Saturation effects / scalability behavior with bandwidth-bound code
- Enforce proper thread/process to core binding, using appropriate tools (whatever you use, but use SOMETHING)
- **Memory page placement on ccNUMA nodes** 
	- Automatic optimal page placement for one (or more) MPI processes per ccNUMA domain (solution A)
	- Explicitly parallel first-touch initialization only required for multi-domain MPI processes (solution B)

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Topology and affinity on multicore

 $\blacksquare$ 

# **Programming models - MPI + OpenMP**

### **Hands-On #2** General considerations

## **Pinning**

How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore **> Hands-on: Pinning** Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

## **Programming models - MPI + OpenMP**

### **Case study: The Multi-Zone NAS Parallel Benchmarks**

General considerations How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning **> Case study: The Multi-Zone NAS Parallel Benchmarks** Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

#### Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### **Courtesy of Gabriele Jost**

# Load Balancing with hybrid programming

- On same or different level of parallelism
- OpenMP enables
	- cheap *dynamic* and *guided* load-balancing
	- via a parallelization option (clause on omp for / do directive)
	- without additional software effort
	- without explicit data movement
- On MPI level

```
#pragma omp parallel for schedule(dynamic)
for (i=0; i<n; i++) {
  /* poorly balanced iterations */ …
}
```
- Dynamic load balancing requires moving of parts of the data structure through the network
- Significant runtime overhead
- Complicated software  $\rightarrow$  rarely implemented
- MPI & OpenMP
	- Simple static load balancing on MPI level, I medium-quality, dynamic or quided on OpenMP level  $\bigcap$  cheap implementation

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  General considerations

## The Multi-Zone NAS Parallel Benchmarks





Multi-zone versions of the NAS Parallel Benchmarks LU,SP, and BT

- Two hybrid sample implementations
- Load balance heuristics part of sample codes
- <https://www.nas.nasa.gov/publications/npb.html>

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### **Courtesy of Gabriele Jost**

## MPI/OpenMP BT-MZ structure



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

**Courtesy of Gabriele Jost**

# Benchmark Characteristics

- Aggregate sizes:
	- Class D: 1632 x 1216 x 34 grid points
	- Class E:  $4224 \times 3456 \times 92$  grid points
- $\blacksquare$  BT-MZ: (Block tridiagonal simulated CFD application)
	- Alternative Directions Implicit (ADI) method
	- #Zones: 1024 (D), 4096 (E)
	- Size of the zones varies widely:
		- large/small about 20
		- requires multi-level parallelism to achieve a good load-balance
- SP-MZ: (Scalar Pentadiagonal simulated CFD application)
	- #Zones: 1024 (D), 4096 (E)
	- Size of zones identical
		- no load-balancing required

Expectations:

Pure MPI: Loadbalancing problems! Good candidate for MPI+OpenMP

Load-balanced on MPI level: Pure MPI should perform best

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## NPB-MZ Class E Scalability on Lonestar



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

**Courtesy of Gabriele Jost**

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Case study: MZ NAS PBM

#### 72/239

Ш
#### skipped MPI+OpenMP memory usage of NPB-MZ



Using more OpenMP threads reduces the memory usage substantially, up to five times on Hopper Cray XT5 (eight-core nodes).

Hongzhang Shan, Haoqiang Jin, Karl Fuerlinger, Alice Koniges, Nicholas J. Wright: *Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms*. Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.

> **Slide, courtesy of Alice Koniges NERSC, LBLN**

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Case study: MZ NAS PBM

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# **Programming models - MPI + OpenMP**

#### **Hands-On #3** General considerations

#### **Masteronly hybrid Jacobi**

- How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks **> Hands-on: Masteronly hybrid Jacobi** Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi
	- Main advantages, disadvantages, conclusions

#### Example: MPI+OpenMP-Hybrid Jacobi solver

#### **Source code: See<http://tiny.cc/MPIX-VSC>**

VSC LRZ

- This is a Jacobi solver (2D stencil code) with domain decomposition and halo exchange
- $\blacksquare$  The given code is MPI-only. You can build it with make (take a look at the  $\text{Makefile}$ ) and run it with something like this (adapt to local requirements):

\$ <*mpirun-or-whatever*> -np <*numprocs*> ./jacobi.exe < input

Task: parallelize it with OpenMP to get a hybrid MPI+OpenMP code, and run it effectively on the given hardware.

- Notes:
	- The code is strongly memory bound at the problem size set in the input file
	- Learn how to take control of affinity with MPI and especially with MPI+OpenMP
	- Always run multiple times and observe performance variations
	- If you know how, try to calculate the maximum possible performance and use it as a "light speed" baseline

<http://tiny.cc/MPIX-VSC>

<http://tiny.cc/MPIX-LRZ> alternative for

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Example / Exercise

## Example cont'd

- Tasks (we assume  $N_c$  cores per CPU socket):
	- **Run the MPI-only code on one node with 1,...,**  $N_c$ **, ...,**  $2*N_c$  **processes (1 full node) and observe the** achieved performance behavior
	- Parallelize appropriate loops with OpenMP
	- **Run with OpenMP and 1 MPI process ("OpenMP-only") on 1,...,** $N_c$ **,...,2\*** $N_c$  **cores,** compare with MPI-only run
	- Run hybrid variants with different MPI vs. OpenMP ratios
- **Things to observe** 
	- Run-to-run performance variations
	- Does the OpenMP/hybrid code perform as well as the MPI code? If it doesn't, fix it!



<http://tiny.cc/MPIX-VSC>

<http://tiny.cc/MPIX-LRZ> alternative for

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Example / Exercise

# **Programming models - MPI + OpenMP**

#### **Overlapping Communication and Computation**

General considerations How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi **> Overlapping communication and computation** Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi Main advantages, disadvantages, conclusions

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

# Sleeping threads with masteronly style



- Problem:
	- Sleeping threads are wasting CPU time
- Solution:
	- Overlapping of computation and communication
- **Limited benefit:** 
	- Best case: reduces communication overhead from 50% to 0%  $\rightarrow$  speedup of 2x
	- Usual case of 20% to 0%
		- $\rightarrow$  speedup of 1.25x
	- Requires significant work  $\rightarrow$  later

 $\blacksquare$ 

#### Nonblocking vs. threading for overlapped comm.

- Why not use nonblocking calls?
	- Asynchronous progress not guaranteed
	- Options (implementation dependent):
		- Communication offload to NIC
		- Additional internal progress thread (MPI\_ASYNC… with MPICH)
	- Intranode and internode communication may be handled very differently
- Using threading for communication overlap
	- One or more threads/tasks handles communication, rest of team "do the work"
	- How to organize the work sharing among all threads?
		- Non-communicating threads
		- Communicating threads after communication is over
	- Not all of the work can usually be overlapped  $\rightarrow$  see next slide

## Using threading/tasking for comm. overlap



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

 $\overline{\phantom{a}}$ 

#### Explicit overlapping of communication and computation

#### The basic principle appears simple:



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

 $\blacksquare$ 

### Overlapping communication with computation

Three problems:

- Application problem: separate application into
	- code that can run before the halo data is received
	- code that needs halo data
	- **May be hard to do**
- Thread-rank problem: distinguish comm. / comp. via thread ID
	- Work sharing and load balancing is harder
	- Options
		- Fully manual work distribution
		- Nested parallelism
		- Tasking & taskloops
		- Partitioned comm (MPI-4.0)
- Optimal memory placement on ccNUMA may be difficult

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

**}**

**}**

error-prone & clumsy

error-prone & clumsy

```
if (my_thread_ID < 1) {
  MPI_Send/Recv
} else {
  my_thread_range=(high-low-1)/(num_threads-1)+1;
  my_thread_low=low+(my_thread_ID-1)*my_thread_range;
  my_thread_high=low+(my_thread_ID-1+1)
               *my_thread_range;
  my_thread_high=min(high, my_thread_high);
  for (i=my_thread_low; i<my_thread_high; i++) {
            ...
```
П

82/239

### Example: sparse matrix-vector multiply (spMVM)



G. Schubert, H. Fehske, G. Hager, and G. Wellein: *Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems.* Parallel Processing Letters **21**(3), 339-358 (2011). [DOI: 10.1142/S0129626411000254](http://dx.doi.org/10.1142/S0129626411000254)

may be much improved!

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

П

# **Programming models - MPI + OpenMP**

# **Communication overlap with OpenMP taskloops**

General considerations How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation **> Communication overlap with OpenMP taskloops** Hands-on: Taskloop-based hybrid Jacobi

Main advantages, disadvantages, conclusions

## OpenMP **taskloop** Directive – Syntax

- Immediately following loop executed in several tasks
- **Not a work-sharing directive!** 
	- Should be executed only by one thread!

■ Fortran:

```
!$OMP taskloop [ clause [ [ , ] clause ] ... ]
```
*do\_loop* [ **!\$OMP end taskloop** [ **nowait** ] ] A task can be run by any thread, across NUMA nodes **→ <sup>8</sup>** perfect first touch impossible!

Loop iterations must be independent, i.e., they can be executed in parallel

If used, the **end do** directive must appear immediately after the end of the loop

 $\blacksquare$   $C/C++$ :

#### **#pragma omp taskloop** [ *clause* [ [ , ] *clause* ] ... ] *new-line for-loop*

The corresponding *for-loop* must have canonical shape  $\rightarrow$  next slide

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

## OpenMP **taskloop** Directive – Details



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

 $\mathbb{R}^n$ 

### OpenMP **single** & **taskloop** Directives



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp. 87/239



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

#### Comm. overlap with task & taskloop Directives – C/C++





–

Extensions in OpenMP-4.0 and 4.5 [07]

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

 $\Box$ 

Tasking example: dense matrix-vector multiply with skipped communication overlap

Data distribution across processes:



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

Dense matrix-vector multiply with communication overlap via tasking

#### Computation/communication scheme:

Skipped



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

 $\mathbf{u}$ 

#### Dense matrix-vector multiply with communication overlap via Skipped tasking

```
#pragma omp parallel
  {
    int tid = omp_get_thread_num();
    int n_start=rank*my_size+min(rest,rank), cur_size=my_size;
    // loop over RHS ring shifts
    for(int rot=0; rot<ranks; rot++) {
      #pragma omp single
      {
        if(rot!=ranks-1) {
          #pragma omp task
          {
            MPI_Isend(buf[0], …, r_neighbor, …, &request[0]);
            MPI_Irecv(buf[1], …, l_neighbor, …, &request[1]);
            MPI_Waitall(2, request, status);
          }
        }
        for(int row=0; row<my_size; row+=4) {
          #pragma omp task
            do_local_mvm_block(a, y, buf, row, n_start, cur_size, n);
        }
      }
      #pragma omp single
        tmpbuf = buf[1]; buf[1] = buf[0]; buf[0] = tmpbuf;n_start += cur_size;
      if(n_start>=size) n_start=0; // wrap around
      cur_size = size_of_rank(l_neighbor,ranks,size);
    }
  }
                                                                       Asynchronous
                                                                       communication
                                                                       (ring shift)
                                                                           Current block of MVM
                                                                           (chunked by 4 rows)
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Overlapping comm. & comp.

 $\blacksquare$ 

## Partitioned Point-to-Point Communication

#### $\blacksquare$  New in MPI-4.0:

*Partitioned communication is "partitioned" because it allows for multiple contributions of data to be made, potentially, from multiple actors (e.g., threads or tasks) in an MPI process to a single communication operation.*

- A point-to-point operation (i.e., send or receive)
	- can be split into partitions,
	- and each partition is filled and then "sent" with **MPI** Pready by a thread;
	- same for receiving

#### Technically provided as a new form of persistent communication.

# **Programming models - MPI + OpenMP**

### Hands-On #4

### **Taskloop-based hybrid Jacobi**

How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops **> Hands-on: Taskloop-based hybrid Jacobi** Main advantages, disadvantages, conclusions

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# **Programming models - MPI + OpenMP**

# Main advantages, disadvantages, conclusions

General considerations How to compile, link, and run Hands-on: Hello hybrid! System topology, ccNUMA, and memory bandwidth Memory placement on ccNUMA systems Topology and affinity on multicore Hands-on: Pinning Case study: The Multi-Zone NAS Parallel Benchmarks Hands-on: Masteronly hybrid Jacobi Overlapping communication and computation Communication overlap with OpenMP taskloops Hands-on: Taskloop-based hybrid Jacobi

**> Main advantages, disadvantages, conclusions**

## MPI+OpenMP: Main advantages

- **Increase parallelism** 
	- Scaling to higher number of cores
	- Adding OpenMP with incremental additional parallelization
- Lower memory requirements due to smaller number of MPI processes
	- Reduced amount of application halos & replicated data
	- Reduced size of MPI internal buffer space
	- Very important on systems with many cores per node
- **Lower communication overhead (possibly)** 
	- Few multithreaded MPI processes vs many single-threaded processes
	- Fewer number of calls and smaller amount of data communicated
	- Topology problems from pure MPI are solved (was application topology versus multilevel hardware topology)
- **Provide for flexible load-balancing on coarse and fine levels** 
	- Smaller #of MPI processes leave room for assigning workload more evenly
	- MPI processes with higher workload could employ more threads

#### Additional advantages when overlapping communication and computation:

■ No sleeping threads

#### MPI+OpenMP: Main disadvantages & challenges

#### Non-Uniform Memory Access:

- Not all memory access is equal: ccNUMA locality effects
- Penalties for access across NUMA domain boundaries
- First touch is needed for *more than one NUMA domain per MPI process*
- Alternative solution: *One MPI process on each NUMA domain (i.e., chip)*
- **Multicore / multisocket anisotropy effects** 
	- **Bandwidth bottlenecks, shared caches**
	- Intra-node MPI performance: Core  $\leftrightarrow$  core vs. socket  $\leftrightarrow$  socket
	- OpenMP loop overhead
- **Amdahl's law on both, MPI and OpenMP level**
- Complex thread and process pinning

Masteronly style (i.e., MPI outside of parallel regions)

**Sleeping threads** 

Additional disadvantages when overlapping communication and computation:

- **High programming overhead**
- **OpenMP** is only partially prepared for this programming style  $\rightarrow$  taskloop directive

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + OpenMP  $\rightarrow$  Conclusions 97/239

## Questions addressed in this tutorial

What is the performance impact of system topology?

It's massive

affinity

Problem dependent

touch placement

How do I map my programming model on the system to my advantage?

 $\blacksquare$  How do I do the split into MPI $+X$ ?

Where do my processes/threads run? How do I take control?

 Where is my data? Process/thread ccNUMA first-

**How can I minimize communication overhead?** 

- How does hybrid programming help with typical HPC problems?
	- Can it reduce communication overhead?
	- Can it reduce replicated data?
- How can I leverage multiple accelerators?
	- What are typical challenges?

# **Programming models - MPI + Accelerator**



**Parts Courtesy of Gabriele Jost**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator 99/239

#### Accelerator programming: Bottlenecks reloaded

Example: 2-socket Intel "Ice Lake" (2x36 cores) node with two NVIDIA A100 GPGPUs (PCIe 4)





 $\rightarrow$  Speedups can only be attained if communication overheads are under control

 $\rightarrow$  Basic estimates help

<span id="page-99-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + Accelerator  $\rightarrow$  General considerations 100 multiple 100/239

#### Accelerator + MPI: How does the data get from A to B?



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + Accelerator  $\rightarrow$  General considerations

## Questions to ask

- Is the MPI implementation CUDA aware?
	- Yes: Can use device pointers in MPI calls
	- No: Explicit DtoH/HtoD buffer transfers required
	- Copying to consecutive halo buffers may still be necessary
- $\blacksquare$  Is NVLink available?
	- Yes: Direct GPU-GPU MPI communication with MPI
		- Supported by: P100, V100, A100, H100
	- No: copies via host (even with NVIDIA GPUDirect)
- Unified Memory or explicit DtoH/HtoD transfers?
	- UM: Transparent sharing of host and device memory
- Actual bandwidths and latencies?
	- Highly system and implementation dependent!

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + Accelerator  $\rightarrow$  General considerations 102/239

See also.

## Never forget: hardware is not enough

- SpMV on NVIDIA A100:
	- Different data formats and libraries
	- 2800 matrices (SuiteSparse Matrix Collection)
- Optimal matrix storage format is highly matrix and system dependent!



H. Anzt, et al; *2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)*, DOI: [10.1109/PMBS51919.2020.00009.](https://dx.doi.org/10.1109/PMBS51919.2020.00009)

ш

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + Accelerator  $\rightarrow$  General considerations 103/239

## Options for hybrid accelerator programming



Which model/combination is the best???

 $\rightarrow$  the one that allows you to address the relevant hardware bottleneck(s)

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + Accelerator  $\rightarrow$  General considerations 104/239

**Programming models - MPI + Accelerator**

**OpenACC**

**General considerations > OpenACC Advantages & main challenges**

<span id="page-104-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + Accelerator  $\rightarrow$  OpenACC

## What is OpenACC?

- API that supports offloading of loops and regions of code (e.g. loops) from a host CPU to an attached accelerator in C, C++, and Fortran
- Managed by a nonprofit corporation formed by a group of companies:
	- CAPS Enterprise, Cray Inc., PGI and NVIDIA
- Set of compiler directives, runtime routines, and environment variables
- Simple programming model for using accelerators (focus on GPGPUs)
- **Memory model:** 
	- Host CPU + Device may have completely separate memory; Data movement between host and device performed by host via runtime calls; Memory on device may not support memory coherence between execution units or need to be supported by explicit barrier
- Execution model:
	- Compute intensive code regions offloaded to the device, executed as kernels ; Host orchestrates data movement, initiates computation, waits for completion; Support for multiple levels of parallelism, including SIMD (gangs, workers, vector)

**The State** 

#### A very simple OpenACC example (PGI 14.10): Vector Triad

```
int main ()
       {
        double a[N], b[N], c[N], d[N];
       ...
      #pragma acc data \
        copyin(b[0:N],c[0:N],d[0:N])
       #pragma acc data copyout (a[0:N])
        compute(a ,b , c ,d ,N);
       ...
       }
data 
mgmt
```

```
pgcc −ta=nvidia , cc35 −Minfo −fast −c triad.c
compute:
9 , Generating present or copyout (a [ :N])
Generating present or copyin (b [ :N])
Generating present or copyin (c [ :N])
Generating present or copyin (d [ :N])
Generating Tesla code
10 , Loop is parallelizable
Accelerator kernel generated
10 , #pragma acc loop gang , vector (1024)...
```

```
void compute (double *restrict a , double *b,...) {
        #pragma acc kernels
        #pragma acc loop vector (1024)
          for(int i=0; i<N ; ++i) {
            a[i] = b[i] + c[i] * d[i];}
        }
execution
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → OpenACC 107/239

 $\overline{\phantom{a}}$ 

### Example: 2D Jacobi smoother

```
#pragma acc data copy(phi1[0:sizex*sizey],phi2[0:sizex*sizey])
{ 
 for(n=0; n<iter; n++) {
#pragma acc kernels
#pragma acc loop independent private(ofs)
    for(int i=1; i<sizex-1; ++i) {
      ofs = i*sizey;
#pragma acc loop independent
      for(int j=1; j<sizey-1; ++j) {
          phi1[ofs+j] = oos * (phi2[ofs+j-1] + 
                                phi2[ofs+j+1] +
                                phi2[ofs+j-sizey] + 
                                phi2[ofs+j+sizey]);
      }
    }
  }
  swap(phi1,phi2); 
}
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → OpenACC 108/239
#### Example: Sparse MVM (std. CSR format) skipped

```
#pragma acc parallel present(val[0:numNonZeros], \
  colInd[0:numNonZeros], \
  rowPtr[0:numRows+1], \
  x[0:numRows], \
  y[0:numRows]) \
  loop
for (int rowID=0; rowID<numRows; ++rowID) { 
  double tmp = y[rowID];// loop over all elements in row 
  for (int rowEntry=rowPtr[rowID]; 
       rowEntry<rowPtr[rowID+1]; 
       ++rowEntry) {
    tmp += val[rowEntry] * x[ colInd[rowEntry] ]; 
  } 
  y[rowID] = tmp;}
```


Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → OpenACC 109/239

#### Example: Sparse MVM (SELL-C-σ format) skipped

```
#pragma acc parallel present(val[0 : capacity],colInd[0 : capacity],\
   chunkPtr[0 : numberOfChunks], chunkLength[0 : numberOfChunks], \
   x[0 : paddedRows],y[0 : paddedRows]) vector_length(chunkSize) loop 
// loop over all chunks 
for (int chunk=0; chunk < numberOfChunks; ++chunk) { 
 int chunkOffset = chunkPtr[chunk]; 
 int rowOffset = chunk*chunkSize; 
 #pragma acc loop vector 
 for (int chunkRow=0; chunkRow<chunkSize; ++chunkRow) {
   int globalRow = rowOffset + chunkRow; 
   // fill tempory vector with values from y 
   double tmp = y[globalRow]; 
   // loop over all row elements in chunk 
   for (int rowEntry=0; 
         rowEntry<chunkLength[chunk]; 
         ++rowEntry) { 
      tmp += val [chunkOffset + rowEntry*chunkSize + chunkRow] 
           * x[colInd[chunkOffset + rowEntry*chunkSize + chunkRow] ]; 
    } 
   // write back result of y = alpha Ax + beta y 
   y[globalRow] = tmp; 
  } 
}
```


M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop: *A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units.* SIAM Journal on Scientific Computing **36**(5), C401–C423 (2014). [DOI: 10.1137/130930352](http://dx.doi.org/10.1137/130930352)

#### Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → OpenACC 1101/239

#### Example: Sparse MVM CRS vs. SELL-128-8192 on Kepler K20



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → OpenACC 1110 1110 1110 1111 1239

### MPI+Accelerators: Main advantages

- **Hybrid MPI/OpenMP and MPI/OpenACC can leverage accelerators and** yield performance increase over pure MPI on multicore
- Compiler/pragma-based API provides relatively easy way to use coprocessors
- OpenACC targeted toward GPU-type coprocessors
- OpenMP 4.0/4.5 extensions provide flexibility to use a wide range of heterogeneous coprocessors (GPU, APU, heterogeneous many-core types)

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → Conclusions 112/239

### MPI+Accelerators: Main challenges

- Considerable implementation effort for basic usage, depending on complexity of the application
- Efficient usage of pragmas requires good understanding of performance issues
	- Performance is not only about code; data structures can be decisive as well
- Support for accelerator pragmas still restricted to certain environments
	- NVIDIA GPUs have best support

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien) Hybrid Programming – MPI+X → Programming models → MPI + Accelerator → Conclusions 113/239

## Questions addressed in this tutorial

- What is the performance impact of system topology?
- How do I map my programming model on the system to my advantage?
	- $\blacksquare$  How do I do the split into MPI $+X$ ?
	- Where do my processes/threads run? How do I take control?
	- Where is my data?
	- How can I minimize communication overhead?
- How does hybrid programming help with typical HPC problems?
	- Can it reduce communication overhead?
	- Can it reduce replicated data?
- How can I leverage multiple accelerators?
	- What are typical challenges? -

Data structures are decisive, inter-device communication support varies

## **Programming models - MPI + MPI-3 shared memory**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory 115/239

## Hybrid MPI + MPI-3 shared memory

#### **What is it?**

- Addon to pure message passing
- MPI processes can share memory segments within a node

#### **Use cases/advantages**

- A: Reducing replicated data  $\rightarrow$  Reduced memory requirements
- B: Reducing intra-node message passing  $\rightarrow$  Reduced intra-node communication time



<span id="page-115-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  General considerations

### Hybrid MPI + MPI-3 shared memory

#### **Further advantages**

- Using only one parallel programming model
- No OpenMP problems (e.g., thread-safety isn't an issue)
- Major Problems
	- Communicator must be split into shared memory islands
	- No increase in exploitable parallelism
	- None of the "automatic" advantages of MPI+OpenMP
		- Exploiting advantages requires programming effort -

See MPI+OpenMP summary

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  General considerations

### Use case A: Reducing memory requirements



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

∟പ്പ, Georg பage: സ്ഥ്രക്ഷവ, Ciaudia Biaas-Scrienner (vSC, TO Wien)<br>Hybrid Programming – MPI+X → Programming models → MPI + MPI-3.0 shared memory → General considerations

#### Use case B: Reducing intra-node message passing



- MPI on each core (not hybrid)
	- Halos between all cores
	- MPI uses internally shared memory and cluster communication protocols
- MPI+OpenMP
	- Multi-threaded MPI processes
	- **Halos communication only between MPI processes**
- MPI cluster communication + MPI shared memory communication
	- Same as "MPI on each core", but
	- within the shared memory nodes, halo communication through direct copying with C or Fortran statements
- MPI cluster comm. + MPI shared memory access
	- Similar to "MPI+OpenMP", but
	- shared memory programming through work-sharing between the MPI processes within each SMP node

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  General considerations

## **Programming models - MPI + MPI-3.0 shared memory**

#### **Re-cap**

- **MPI\_Comm\_split**
- **One-sided communication**

General considerations & uses cases

**> Re-cap: MPI\_Comm\_split & one-sided communication** How-to Exercise: MPI\_Bcast Quiz 1 MPI memory models & synchronization Shared memory problems Advantages & disadvantages, conclusions Quiz 2

<span id="page-119-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap

#### New sub-communicators with MPI\_Comm\_split

- New sub-communicators via MPI\_Comm\_split
	- Each process must specify a color
	- Processes with same color are put together in new sub-communicators





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  MPI\_Comm\_split

ш

#### Example: MPI\_Comm\_split()



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  MPI\_Comm\_split

### Re-cap: One-sided Communication

- Communication parameters for both the sender and receiver are specified by one process (origin)
- User must impose correct ordering of memory accesses



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  One-sided communication

о

#### Typically, all processes are both, origin and target processes



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  One-sided communication 124/239

## One-sided Operations

#### Three major sets of routines:

- **Window creation or allocation** 
	- Each process in a group of processes (defined by a communicator)
	- defines a chunk of own memory named *window*,
	- which can be afterwards accessed by all other processes of the group.
- Remote Memory Access (RMA, nonblocking) routines
	- Access to remote windows: put, get, accumulate, …
- **Synchronization** 
	- The RMA routines are nonblocking and
	- must be surrounded by synchronization routines, which guarantee
		- that the RMA is locally and remotely finished
		- and that all necessary cache operation are implicitly done

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

 $\frac{1}{\frac{1}{2}}$  Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  One-sided communication 125/239

**Shared memory:** direct **loads** and **stores** instead of MPI\_Put/Get

### Sequence of One-sided Operations



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  One-sided communication

# Synchronization Calls (1)

- Active target communication
	- communication paradigm similar to message passing model
	- target process participates only in the synchronization
	- fence or post-start-complete-wait

MPI\_Win\_fence is like a barrier

- Passive target communication
	- communication paradigm closer to shared memory model
	- only the origin process is involved in the communication
	- **-** lock/unlock





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap  $\rightarrow$  One-sided communication 127/239

п

## **Programming models - MPI + MPI-3.0 shared memory**

#### **How-to**

General considerations & uses cases Re-cap: MPI\_Comm\_split & one-sided communication **> How-to** Exercise: MPI\_Bcast Quiz 1 MPI memory models & synchronization Shared memory problems Advantages & disadvantages, conclusions Quiz 2

<span id="page-127-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Re-cap

# MPI shared memory

- Split main communicator into shared memory islands (automatically)
	- **MPI\_Comm\_split\_type**
- **-** Define a shared memory window on each island
	- **MPI\_Win\_allocate\_shared**
	- Result (by default): contiguous array, directly accessible by all processes of the island
- Accesses and synchronization
	- This is normal memory: Language-based expressions and assignments
	- **MPI\_PUT/GET** still allowed, but this is not the spirit!
	- Normal MPI one-sided synchronization, e.g., **MPI\_WIN\_FENCE**
- Caution:
	- Memory may be already completely pinned to the physical memory of the process with rank 0, i.e., the first touch rule (as in OpenMP) does **not** apply!

(First touch rule: a memory page is pinned to the physical memory of the processor that first writes a byte into the page)

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  How-to 129/239

# Splitting & shared memory allocation



a null pointer instead of pointing to next process' base.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

M<sub>2</sub> This mapping is based on the ranking in comm\_all.

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  How-to

ш

#### Shared-memory allocation in Fortran uses C pointer!

#### New in MPI-3.0 **In all three Fortran support methods**

float \*buf; MPI\_Win win; int max\_length; max\_length =  $\dots$  /\* = array size in elements \*/; **MPI\_Win\_allocate\_shared**( (MPI\_Aint)(max\_length\*sizeof(float)), sizeof(float), MPI\_INFO\_NULL, comm\_shm, &buf, &win); // the window elements are buf[0]  $\ldots$  buf[max length-1]



### Within each shared-memory island: essentials



- The allocated shared memory is contiguous across process ranks,
- i.e., the first byte of rank i starts right after the last byte of rank i-1.
- Processes can calculate remote addresses' offsets with local information
- Remote accesses through load/store operations,
	- i.e., without MPI RMA operations (MPI\_Get/Put, …)
- Caution:

Although each process in comm\_sm accesses the same physical memory, the virtual start address of the whole array may be different in all processes!

#### $\rightarrow$  **linked lists** only with offsets in a shared array, but **not with binary pointer addresses!**

Following slides show only the shared memory accesses, i.e., communication between the SMP nodes is not presented.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  How-to 1320 and 1320 and

## Splitting into smaller shared memory islands

e.g., splitting into NUMA nodes or sockets



• Subsets of shared memory nodes, e.g., one comm\_sm on each socket with size\_sm cores **(requires also sequential ranks in comm\_all for each socket!)**

MPI\_Comm\_split\_type (comm\_all, MPI\_COMM\_TYPE\_SHARED, 0, MPI\_INFO\_NULL, &*comm\_sm\_large*); MPI\_Comm\_rank (comm\_sm\_large, *&my\_rank\_sm\_large*); MPI\_Comm\_size (comm\_sm\_large, *&size\_sm\_large*); MPI\_Comm\_split (comm\_sm\_large, /\*color\*/ my\_rank\_sm\_large / size\_sm, 0, *&comm\_sm*); MPI Win allocate shared  $(..., comm sm, ...)$ ; or (size sm large /number of sockets) Here 1 or 2

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  How-to

# Splitting into smaller shared memory islands

- Most MPI libraries have an non-standardized method to split a communicator into NUMA nodes (e.g., sockets):
	- see also Current support for split types [in MPI implementations](https://github.com/mpiwg-hw-topology/hw-topology-issues/wiki/Current-support-for-split-types-in-MPI-implementations-or-MPI-based-libraries) or MPI based libraries
	- **OpenMPI:** choose split type as OMPI\_COMM\_TYPE\_NUMA
	- **HPE**: MPI\_Info\_create (&info); MPI\_Info\_set(info, "shmem\_topo", "numa"); // or "socket" MPI\_Comm\_split\_type(comm\_all, MPI\_COMM\_TYPE\_SHARED, 0, info, &*comm\_sm*); May not work with Intel-MPI
	- **mpich:** split\_type=MPIX\_COMM\_TYPE\_NEIGHBORHOOD, info\_key= "SHMEM\_INFO\_KEY" and value= "machine", "socket", "package", **"numa"**, "core", "hwthread", "pu", "l1cache", ..., or "l5cache"
- Two additional standardized split types: <u>△ New in MPI-4.0</u> May be fixed in MPI-4.1
	- MPI\_COMM\_TYPE\_HW\_GUIDED Drawback: no standardized key values
	- MPI COMM TYPE HW UNGUIDED -Drawback:
- See also Exercise 3.

■ two splits are needed • 1st with MPI\_COMM\_TYPE\_SHARED • 2<sup>nd</sup> with MPI\_COMM\_TYPE\_HW\_UNGUIDED

**Peroblematic if number of NUMA domains is not** identical in all shared memory islands of 1<sup>st</sup> split

### Shared memory access example



MPI course  $\rightarrow$  Chap.11-(1) Shared Memory One-sided Communication



## Alternative: Non-contiguous shared memory

- Using info key "alloc\_shared\_noncontig"
- MPI library can put processes' window portions
	- **into the local NUMA memory domain** 
		- (internally, e.g., each window portion is one OS shared memory segment)
	- **on page boundaries,** 
		- (internally, e.g., only one OS shared memory segment with some unused padding zones)
- **Pros:** Faster local data accesses especially on ccNUMA nodes
- **Cons:** Higher programming effort for neighbor accesses: MPI\_WIN\_SHARED\_QUERY



Grid Size (N)

NUMA effects? Significant impact of

Further reading:

Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur:

**MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.**

[http://link.springer.com/content/pdf/10.1](http://link.springer.com/content/pdf/10.1007/s00607-013-0324-2.pdf) 007%2Fs00607-013-0324-2.pdf

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Skipped Non-contiguous shared memory allocation



```
MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;
disp unit = sizeof(double); /* shared memory should contain doubles */MPI_Info_info_noncontig;
MPI Info create (&info_noncontig);
MPI Info set (info noncontig, "alloc shared noncontig", "true");
MPI_Win_allocate_shared ((MPI_Aint) local_window_count*disp_unit, disp_unit, info_noncontig,
                          comm_sm, &base_ptr, &win_sm );
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  How-to 138/239

#### Neighbor access through MPI\_WIN\_SHARED\_QUERY

- Each process can retrieve each neighbor's base\_ptr with calls to MPI\_WIN\_SHARED\_QUERY
- Example: only pointers to the window memory of the left & right neighbor

If only one process allocates the whole window  $\rightarrow$  to get the base\_ptr, all processes call MPI\_WIN\_SHARED\_QUERY

#### local call

if (my rank  $sm > 0$ ) **MPI\_Win\_shared\_query** (win sm, my rank sm  $- 1$ , &*win\_size\_left*, &*disp\_unit\_left*, &*base\_ptr\_left*);

if (my\_rank\_sm < size\_sm-1) **MPI\_Win\_shared\_query** (win\_sm, my\_rank\_sm **+ 1**, &*win\_size\_right*, &*disp\_unit\_right*, &*base\_ptr\_right*);

MPI Win fence (0, win sm); /\* local stores are finished, remote load epoch can start  $*/$ if (my\_rank\_sm > 0) printf("left neighbor's rightmost value = %lf \n", **base ptr left[** win size left/disp unit left  $-1$  **]** ); if (my\_rank\_sm < size\_sm-1) printf("right neighbor's leftmost value = %lf \n",

**base\_ptr\_right[ 0 ]** );

Thanks to Steffen Weise (TU Freiberg) for testing and correcting the example codes.

**base\_ptr\_left base\_ptr\_right**

…

#### Whole shared memory allocation by rank 0 in comm\_sm



#### Caution: On some systems

- the number of shared memory windows, and
- the total size of shared memory windows

may be limited.

Some OS systems may provide options,

- e.g., at job launch, or
- MPI process start,

to enlarge restricting defaults.

If MPI shared memory support is based on POSIX shared memory:

- Shared memory windows are located in memory-mapped /dev/shm or /run/shm
- Default: 25% or 50% of the physical memory
- Root may change size with: mount -o remount,size=6G /dev/shm
- $\blacksquare$  Maximum of  $\sim$ 2043 windows!

due to default limit of context IDs in mpich

On some systems: No limits.

On a system without virtual memory you have to reserve a chunk of address space when the node is booted (at job script launch).

Thanks to Jeff Hammond and Jed Brown (ANL), Brian W Barrett (SANDIA), and Steffen Weise (TU Freiberg), for input and discussion.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  How-to

# Questions addressed in this tutorial

- Where we are?
- What is the performance impact of system topology?
- How do I map my programming model on the system to my advantage?
	- $\blacksquare$  How do I do the split into MPI $+X$ ?
	- Where do my processes/threads run? How do I take control?
	- Where is my data?
	- How can I minimize communication overhead?
- How does hybrid programming help with typical HPC problems?
	- Can it reduce communication overhead?
	- Can it reduce replicated data? -MPI-3 shared memory as a real alternative to OpenMP shared memory, especially when OpenMP hard to be used
- How can I leverage multiple accelerators?
	- What are typical challenges?



<span id="page-142-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Exercise: MPI\_Bcast

### Exercise: MPI\_Bcast into shared memory

- **Now illustrated as in the previous slides**
- Each  $\Box$  represents such a replicated memory  $\Box$  within an island



- Application: We'll store numbers 1, 2, … into the green array by process 0
- And then bcast it to all other shared memory islands
- At the end, each process calculates the sum of all numbers within *its* shared memory

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Exercise: MPI\_Bcast


Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Exercise: MPI\_Bcast

 $\blacksquare$ 



4<sup>th</sup> exercise step (~5 lines of code +1 lines printing)

 $\mathcal{L}_{\mathcal{A}}$ 

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → MPI + MPI-3.0 shared memory → Exercise: MPI\_Bcast See login-slides

**1-slide Sol.**

146/239

### Exercise: MPI\_Bcast into shared

#### **Preparation**

**C**

- Directories in your personal account:
	- $\bigcup$  HY- $\bigvee_{1}^{VSC}$ /data-rep/C-data-rep: LRZ
		- data-rep\_base.c
		- data-rep\_exercise.c
		- data-rep\_base\_ $\frac{VSC}{LRZ}$   $\frac{2x16.sh / \frac{4x48}{4x28} sh$  (using 2 and 4 nodes)
		- VSC LRZ
		- data-rep\_solution\_ $\frac{VSC}{LRZ}$  -2x16.sh /  $\frac{4x48}{4x28}$ .sh (again with 2 and 4 nodes)
		- data-rep\_exercise\_orig.c (only for: diff data-rep\_exercise\_orig.c data-rep\_exercise.c )
		- (already together with all solution files)

- data-rep\_base\_30.f90
- data-rep\_exercise\_30.f90
- data-rep .............................sh (ditto., see above)
- data-rep\_exercise\_orig\_30.f90 (only for: diff data-rep\_exercise\_orig\_30.f90 data-rep\_exercise\_30.f90 )
- (already together with all solution files)
- data-rep\_base.c / \_30.f90 is the original MPI program
- data-rep\_exercise.c /  $\frac{30.190}{20.500}$  is the basis for this shared memory exercise

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Exercise: MPI\_Bcast

(using only 2 nodes during the exercise)



(Preparation, 10 Minutes)

- **data-rep\_base.c** / \_30.f90 is the original MPI program: **D**  $\frac{2}{3}$  Oo NOT edit
	- It copies data from the process rank 0 in MPI\_COMM\_WORLD to all processes.
	- On all processes it uses the data: in this example, just the sum is calculated.
	- Compile it and run it:
		- module load intel intel-mpi
		- mpiicc -o data-rep\_base data-rep\_base.c
		- mpiifort -o data-rep\_base data-rep\_base\_30.f90
		- sbatch data-rep\_base\_**VSC**\_2x16.sh (will use 2 nodes with only 16 processes [on 2 CPUs x 8 cores]

per node and 4 nodes with all 2x24 = 48 cores per node) sq sales and the set of the set of

- 
- 

sinfo | grep idle (if you do not have a reservation)

- Output will be written to: **slurm-\*.out**
- Output from only 2 nodes (each with 16 MPI processes):



- 1<sup>st</sup> time step
- output from 3 processes per communicator:
- ranks 0, 1 & last rank

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Exercise: MPI\_Bcast

**Parts of the software, courtesy of Irene Reichl (VSC, TU Wien)**

- data-rep\_exercise.c / \_30.f90 is the skeleton for all steps of this exercise
	- Please edit and change it from step to step!
	- Declare variables comm\_shm, size\_shm, rank\_shm (2 lines of code)
	- Split MPI\_COMM\_WORLD into shared memory island communicators comm\_shm (use key == 0) (1 line of code)
	- Query size\_shm, rank\_shm (2 lines of code)

■ Step 2a:

 After this splitting: print and stop (3 lines of code, copy print statement from end of your source file) /\*TEST\*/ // To minimize the output, we print only from 3 process per SMP node

```
/*TEST*/ if ( rank shm == 0 || rank shm == 1 || rank shm == size shm - 1 )
    printf("\t\t rank ( world: %i, shm: %i)\n", rank world, rank shm);
```
- /\*TEST\*/ if(rank world==0) printf("ALL finalize and return !!!.\n"); MPI Finalize(); return 0;
- Expected output from 2 islands, each with 16 processes:



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### ■ Steps 2b-d:

- Declare needed variables (5 LOC)
	- (2b) *if* (rank\_shm == 0) *then* individualShmSize = arrSize *else* individualShmSize = 0 (4 LOC)
	- (2c) MPI\_Win\_allocate\_shared (comm\_shm  $\rightarrow$  win & shm\_base\_ptr (but only if rank\_shm== 0)) (1 LOC)
	- (2d) MPI Win shared query ( win & rank  $0 \rightarrow \text{arr}$ , i.e., the base pointer on all processes); (1 LOC)
- After this splitting: print and stop (3 lines of code)
- **Expected output from 2 islands, each with 16 processes:**

rank ( world: 0/32, shm: 0/16) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = 0x2b1738903000, arr\_ptr =0x2b1738903000 ALL finalize and return !!!. rank ( world: 16/32, shm:  $0/16$ ) arrSize 100000000 arrSize\_ 80000000 shm\_buf\_ptr = 0x2b2489dfb000, arr\_ptr =0x2b2489 tfb000

rank ( world: 1/32, shm: 1/16) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2aef69d3a000 rank ( world: 31/32, shm: 15/16) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr =  $(nil)$ , arr\_ptr = 0x2b4dcb01e000 rank ( world: 15/32, shm: 15/16) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2b56e7916000 rank ( world: 17/32, shm: 1/16) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2b42516bb000

#### After ~20 Minutes:

- compare with solution: data-rep\_sol\_2d.c / \_30.f90
- In case of problems you may also look atthe solution slide: **Sol.**

Processes with **individualShmSize = 0**, do not get a buffer pointer from MPI\_Win\_allocate\_shared

Output from • 1<sup>st</sup> island • 2<sup>nd</sup> island

> Each process within an island has **different virtual addresses** for the **same** shared memory array

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Steps 2e-f:

- Declare needed variables (3 LOC)
- (2e) *if* (rank\_shm == 0) *then* color=0 *else* color=MPI\_UNDEFINED (2 LOC)
- (2f) MPI Comm split(MPI COMM WORLD, key=0, color  $\rightarrow$  comm head )  $\rightarrow$  rank head (8 LOC) and in all processes with color==MPI\_UNDEFINED  $\rightarrow$  MPI\_COMM\_NULL
- After this splitting: print and stop  $(3 \text{ LOC})$
- Expected output from 2 islands, each with 16 processes:

rank ( world: 1/32, shm: 1/16, head: -1/-1) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2abc98db8000 rank ( world: 0/32, shm: 0/16, head: 0/2) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = 0x2ab…, arr\_ptr = 0x2ab4acc56000 ALL finalize and return !!!

rank ( world: 16/32, shm: 0/16, head: 1/2) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = 0x2ad…, arr\_ptr = 0x2adbc5fe6000 rank ( world: 15/32, shm:  $15/16$ , head: -1/-1) arr $\frac{1}{2}$  is 100000000 arrSize 800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2af4c52e5000 rank ( world: 17/32, shm:  $\hat{N}$ 16, head: -1/-1) arrSize 100000000 arrSize\_800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2b702ad9b000 rank ( world: 31/32, shm: 15/16, head: -1/-1) arrSize 100000000 arrSize\_ 800000000 shm\_buf\_ptr = (nil), arr\_ptr = 0x2b6e54bdf000

Finished earlier?

#### After ~10 Minutes:

- compare with solution: data-rep\_sol\_2f.c / \_30.f90
- Incase of problems you may also look at the solution slide: **Sol.**
- **Ninute 1** Whole exercise steps 2a-f: 40 Minutes
	- Online course: please come back to the main room
- Go to **advanced exercise** on **next slide**
	- Advanced exercise on a **copy** of your data-rep\_exercise.c / \_30.f90: Split your shared memory islands into NUMA domains

#### Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### Advanced Exe: Breaking the world into NUMA islands

- Steps 2a-f: We split MPI\_COMM\_WORLD into ccNNUMA islands, each with 2 CPUs
- Step 2a-f-NUMA:
	- Copy your result or data-rep\_sol\_2f.c / \_30.f90 into data-rep\_exercise**\_NUMA**.c / \_30.f90
	- For this advanced exercise, switch from Intel-MPI to OpenMPI Prepared for VSC only
		- module purge
		- module load openmpi
		- mpicc -o data-rep\_exercise\_openmpi data-rep\_exercise\_NUMA.c
		- mpifort -o data-rep\_exercise\_openmpi data-rep\_exercise\_NUMA\_30.f90
		- sbatch data-rep\_exercise\_VSC\_2x16\_OpenMPI.sh (or only 1x16  $\rightarrow$  splitting into the 2 CPUs)
	- Split MPI\_COMM\_WORLD into NUMA islands  $\rightarrow$  you expect the double amount of comm\_shm
		- Use the non-standardized method for OpenMPI
	- Expected result: 4 shared memory islands, each consisting of the MPI processes running on a CPU



Compare with solution: data-rep\_sol\_2f\_NUMA\_OpenMPI.c / \_30.f90

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Steps 3-6 (6 lines of code)

- **(3-4) Store epoch:** we store the replicated data in all shared memories (don't forget **MPI\_Win\_fence()** within all comm\_shm/win before starting the store epoch for **arr**)
- (3) Process with rank\_world==0 **stores numbers** into its green **arr**
- (4) All processes in comm\_head **MPI\_Bcast()** the data from rank\_head==0 to all others
- **(5) Local load epoch:** each process reads the data and locally **calculates the sum** (don't forget **MPI\_Win\_fence()** within all comm\_shm / win before starting the local load epoch)
- **(6) Print** the results
- Expected output from 2 islands:

it: 0, rank ( world: 0/32, shm: 0/16, head: 0/2 ): sum(i=0...i=99999999) = 4999999950000000 it: 0, rank ( world:  $16/32$ , shm:  $0/16$ , head:  $1/2$  ): sum(i=0...i=999999999) = 49999999950000000 it: 0, rank ( world: 1/32, shm: 1/16, head:  $-1/21$  ): sum( $i=0...i=999999999$ ) = 49999999950000000 it: 0, rank ( world: 17/32, shm: 1/16, head:  $-1/-1$  ): sum( $i=0...i=999999999$ ) = 4999999950000000 it: 0, rank ( world: 31/32, shm: 15/16, head:  $-1/-1$  ): sum(i=0...i=999999999, = 49999999950000000 it: 0, rank ( world: 15/32, shm: 15/16, head: -1/-1 ): sum(i=0...i=99999999) = 4999999950000000 Same data in the shared memory arrays of both SMP nodes

**Sol.**

#### After ~10 Minutes:

- compare with solution: data-rep\_sol\_3-6.c / \_30.f90
- In case of problems you may also look at the solution slide:  $\boxed{\Gamma}$

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Step 7 (6 lines of code)

Finish the local load epoch  $\rightarrow$  **MPI\_Win\_fence()** // free the window  $\rightarrow$  **MPI\_Win\_free()** 

#### Expected output from 2 islands **(same as after Step 6, but now without premature stop):**

it: 0, rank ( world:  $0/32$ , shm:  $0/16$ , head:  $0/2$  ): sum(i=0...i=999999999) = 49999999950000000 it: 0, rank ( world:  $16/32$ , shm:  $0/16$ , head:  $1/2$  ): sum( $i=0$ ... $i=999999999$  $i=4999999950000000$ it: 0, rank ( world: 1/32, shm: 1/16, head: -1/-1 ): sum(i=0...i=99999999) = 4999999950000000 it: 0, rank ( world: 17/32, shm: 1/16, head: -1/-1 ): sum(i=0...i=99999999) = 4999999950000000 it: 0, rank ( world: 31/32, shm: 15/16, head: -1/-1 ): sum(i=0...i=99999999) = 4999999950000000 it: 0, rank ( world:  $15/32$ , shm:  $15/16$ , head:  $-1/-1$  ): sum( $i=0...i=9999999999999999500000000$ 

#### **After ~5 Minutes, in the solution directory:**

- compare with solution: data-rep\_sol\_7.c / \_30.f90
- **In case of problems you may also look at the solution slide:**  $\Box$
- And add-on: data-rep\_solution.c /  $30.690$  with additional analysis and output:

The number of shared memory islands is: 2 islands

The size of each shared memory islands is: 48 processes

- **Whole exercise steps 3-6 & 7: approx. 20 Minutes**
- Q & A & Discussion

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# **Quiz on Shared Memory**

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

- A. Before you call **MPI\_Win\_allocate\_shared**, what should you do?
- B. If your communicator within your shared memory island consists of 12 MPI processes, and each process wants to get an own window with 10 doubles (each 8 bytes),
	- a. which **window size** must you specify in **MPI\_Win\_allocate\_shared**?
	- b. And how long is the totally allocated shared memory?

c. The returned base ptr, will it be identical on all 12 processes?

- d. If all 12 processes want to have a pointer that points to the beginning of the totally allocated shared memory, which MPI procedure should you use and with which major argument?
- e. If you do this, do these 12 pointers have identical values, i.e., are identical addresses?
- C. Which is the major method to store data from one process into the shared memory window portion of another process?

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien) Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Quiz 1 155 239



# **Programming models - MPI + MPI-3.0 shared memory**

#### **MPI Memory Models & Synchronization**

General considerations & uses cases Re-cap: MPI\_Comm\_split & one-sided communication How-to Exercise: MPI\_Bcast Quiz 1 **> MPI memory models & synchronization** Shared memory problems Advantages & disadvantages, conclusions Quiz 2

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### How to achieve even lower latencies

A key feature for strong scaling?

**Outlook** 

- Use of MPI shared memory without (slow) MPI one-sided synchronization methods (e.g., win\_fence)
- To do this, use memory variables for synchronization together with memory fences (C++11 or MPI based)

Alternative:

■ Fast MPI point-to-point sync together with memory fences

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## Two memory models

 Query for new attribute to allow applications to tune for cache-coherent architectures



(MPI-3.1/-4.0 Section 11/12.7: "*Advice to users*. In the unified memory model…" in U5 on page 456/613f, and Section 11/12.8, Example 11/12.21 on pages 468f/626f)

or any other RMA synchronization:

*"A consistent view can be created in the unified memory model (see Section 11.4) by utilizing the window synchronization functions (see Section 11.5) or explicitly completing outstanding store accesses (e.g., by calling MPI\_WIN\_FLUSH)."*

**The Second** 

(MPI-3.1/-4.0, MPI\_Win\_allocate\_shared, page 408/560, lines 43-47/22-26)

## "eventually synchronized" – the problem

The problem with shared memory programming using libraries is:

X is a variable in a shared window initialized with 0.



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### "eventually synchronized" – the Solution

A pair of local memory fences is needed:

X is a variable in a shared window initialized with 0.



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# "eventually synchronized" – Last Question

- How to make the local memory fence ?
- C++11 atomic\_thread\_fence(order)
	- Advantage: one can choose appropriate order = memory\_order\_release, or …\_acquire to achieve minimal latencies
- MPI\_Win\_sync
	- Advantage: works also for Fortran
	- Disadvantage: may be slower than C11 atomic thread fence with appropriate order
- Using RMA synchronization with integrated local memory fence instead of MPI\_Send  $\rightarrow$  MPI\_Recv 5 sync methods, see next slide
	- Advantage:
		- May prevent double fences
	- Disadvantage: The synchronization itself may be slower





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### General MPI shared memory synchronization rules

(based on MPI-3.1/4.0, MPI\_Win\_allocate\_shared, page 408/560, lines 43-47/22-26: *"A consistent view …"*)



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### "Any-process-sync" & MPI\_Win\_sync on shared memory



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memor

 $\blacksquare$ 

# Halo communication benchmarking

Goal:

See HLRS online courses **<http://www.hlrs.de/training/self-study-materials>**  $\rightarrow$  Practical  $\rightarrow$  MPI.tar.gz  $\rightarrow$  subdirectory MPI/course/C/1sided/

- Learn about the communication latency and bandwidth on your system
- Method:
	- **cp MPI/course/C/1sided/halo\* .**
	- On a shared or distributed memory, run and compare:
- Make a diff from one version to the next version of the source code
- Compare latency and bandwidth
- halo\_irecv\_send.c ] halo isend recv.c. halo neighbor alltoall.c halo\_1sided\_put.c **Example 2** halo\_1sided\_put\_alloc\_mem.c halo\_1sided\_put\_win\_alloc.c And run and compare on a shared memory only: Example 3 halo\_1sided\_store\_win\_alloc\_shared.c halo\_1sided\_store\_win\_alloc\_shared\_query.c (with alloc\_shared\_noncontig) halo\_1sided\_store\_win\_alloc\_shared\_pscw.c halo\_1sided\_store\_win\_alloc\_shared\_othersync.c halo\_1sided\_store\_win\_alloc\_shared\_signal.c Different communication methods Different memory allocation methods Different communication methods  $\sqrt{\mathsf{Example}1}$ Example  $5\overline{\phantom{a}}$ **Example 4**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### MPI communication inside of SMP nodes: Benchmark results on a Cray XE6 – 1-dim ring communication on 1 node with 32 cores



On Cray XE6 Hermit at HLRS with aprun –n 32 –d 1 –ss, best values out of 6 repetitions, modules PrgEnv-cray/4.1.40 and cray-mpich2/6.2.1

#### Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# **Programming models - MPI + MPI-3.0 shared memory**

### **Shared memory problems**

General considerations & uses cases Re-cap: MPI\_Comm\_split & one-sided communication How-to Exercise: MPI\_Bcast Quiz 1 MPI memory models & synchronization **> Shared memory problems** Advantages & disadvantages, conclusions Quiz 2

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

# Shared memory problems (1/2)

#### $\blacksquare$  Race conditions

- as with OpenMP or any other shared memory programming models
- Data-Race: *Two processes access the same shared variable and at least one process modifies the variable and the accesses are concurrent, i.e. unsynchronized, i.e., it is not defined which access is first*
- The outcome of a program depends on the detailed timing of the accesses
- This is often caused by unintended access to the same variable, or missing memory fences

# Shared memory problems (2/2)

#### ■ Cache-line false-sharing

- As with OpenMP or any other shared memory programming models
- The cache-line is the smallest entity usually accessible in memory



- Several processes are accessing shared data through the same cache-line.
- This cache-line has to be moved between these processes (cache coherence protocol).
- This is very time-consuming.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Shared memory problems 168/239

m.

## **Programming models - MPI + MPI-3.0 shared memory**

#### **Advantages & disadvantages, conclusions**

General considerations & uses cases Re-cap: MPI\_Comm\_split & one-sided communication How-to Exercise: MPI\_Bcast Quiz 1 MPI memory models & synchronization Shared memory problems **> Advantages & disadvantages, conclusions** Quiz 2

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### Questions addressed in this tutorial

- What is the performance impact of system topology?
- How do I map my programming model on the system to my advantage?
	- $\blacksquare$  How do I do the split into MPI $+X$ ?
	- Where do my processes/threads run? How do I take control?
	- Where is my data?
	- **How can I minimize communication overhead?**
- How does hybrid programming help with typical HPC problems?
	- Can it reduce communication overhead?
	- Can it reduce replicated data? -MPI-3 shared memory as a real alternative to OpenMP shared memory, especially when OpenMP hard to be used
- How can I leverage multiple accelerators?
	- What are typical challenges?

Where we are?

Fastest accesses between MPI processes on a shared memory

### MPI+MPI-3.0 shared mem: Main advantages

- A new method for reducing memory consumption for replicated data
	- To allow only one replication per shared-memory island
- Interesting method for direct access to neighbor data (without halos!)
- A new method for communicating between MPI processes within each shared-memory node
- On some platforms significantly better bandwidth than with send/recv
- Library calls need not be "thread safe" because we do not have threads

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Conclusions 1710 memory 171/239

## MPI+MPI-3.0 shared mem: Main challenges

- Synchronization is defined, but still under discussion:
	- The meaning of the assertions for shared memory is still undefined as of MPI 4.0
- Similar problems as with all shared memory (e.g., pthreads, OpenMP,…)
	- Race conditions, false sharing, memory fences
- Does not reduce the number of MPI processes

## MPI+MPI-3.0 shared mem: Conclusions

- **Add-on feature for pure MPI communication**
- Opportunity for reducing communication within shared-memory nodes
- Opportunity for reducing memory consumption (halos & replicated data)

### **Quiz on Shared Memory Model & Synchronization**

- A. Which MPI memory model applies to MPI shared memory? MPI\_WIN\_SEPARATE or MPI\_WIN\_UNIFIED ?
- B. "Public and private copies are . . . . ? . . . . synchronized without additional RMA calls."



C. Which process-to-process synchronization methods can be used that, e.g., a store to a shared memory variable gets visible to another process (within the processes of the shared memory window)?

D. That such a store gets visible in another process after the synchronization is named here as "*write-read-rule*". Which other rules are implied by such synchronizations and what do they mean?

 $\blacksquare$ 

 $\blacksquare$ 

E. How can you define a **race-condition** and which problems arise from **cache-line false-sharing**?

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Quiz 2 174/239



# **Programming models - pure MPI**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI 175/239

#### Advantages

MPI library need not to support multiple threads (may have performance advantages)

Major problems

- Does application topology fit on hardware topology?
	- Want minimal communication between MPI processes AND between cluster nodes
- Does the MPI library employ shared memory protocols internally?
- Is the network prepared for massive numbers of messages?
- MPI communication inside of shared memory nodes also costs time
- Generally "a lot of" communicating processes per node
- <span id="page-175-0"></span>Memory consumption may be a problem (halos, replicated data, internal MPI buffers)

# **Programming models - pure MPI**

#### **The Topology Problem**

General considerations

**> The topology problem**

The topology problem: How-to / Virtual Toplogies Rank renumbering for optimization The Topology Problem: Unstructured Grids Quiz **Scalability** Advantages & disadvantages, conclusions

<span id="page-176-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem

#### Re-numbering on a cluster of SMPs (cores / CPUs / nodes)

#### Example:

- 2-dim 6000 x 8080 data mesh points
- To be parallelized on 48 cores
- Minimal communication
	- Subdomains as quadratic as possible
		- $\rightarrow$  minimal circumference
		- minimal halo communication
	- virtual 2-dim process grid: 6 x 8 with 1000 x 1010 mesh points/core
- Hardware example: 48 cores:
	- 4 compute nodes



- each node with 2 CPUs
- each CPU with 6 cores



- How to locate the MPI processes on the hardware?
	- Using sequential ranks in MPI\_COMM\_WORLD
	- Optimized placement
	- $\rightarrow$  See next slides and example code





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem



### Hierarchical Cartesian Domain Decomposition



Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem 179/239

#### Levels of communication & data access

- **Three levels:** 
	- **Between the SMP nodes**
	- Between the sockets inside of shared-memory node
	- Between the cores of a socket
- On all levels, the communication should be minimized:
	- With 3-dimensional sub-domains:
		- They should be as cubic as possible  $=$  minimal surface  $=$  minimal communication

data communicated to the

 "as cubic as possible" may be qualified due to different communication bandwidth in each direction

caused by sending (fast) non-strided or (slow) strided data

or accessed between the cores inside of a node



**Shared-memory node SMP node**

**Socket 1 Quad-core**

**Socket 1 Quad-core**


## Levels of communication & data access

- **Major goal: minimize inter-node communication**  $\rightarrow$ Minimize sum of all outer subdomain surfaces  $\rightarrow$ Whole node subdomain shape as cubic as possible
- Secondary goal: minimize intra-node communication  $\rightarrow$ Minimize sum of all inner subdomain surfaces  $\rightarrow$ Inner subdomain shape as cubic as possible

Next slides: MPI facilities to map topology to ranks in a communicator  $\rightarrow$  Virtual Topologies

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem



## **Programming models - pure MPI**

#### How to  $\rightarrow$  MPI Virtual Topologies

General considerations

The topology problem

**> The topology problem: How-to / Virtual Toplogies** Rank renumbering for optimization The Topology Problem: Unstructured Grids

Quiz

**Scalability** 

Advantages & disadvantages, conclusions

#### Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  How-to: Virtual MPI topologies

Acknowledgement: Virtual topology course slides are<br>based on the MPI-1 course of EPCC.

182/239

## Domain decomposition example



- **process coordinates:** handled with **virtual Cartesian topologies**
- array decomposition: handled by the application program directly

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

MPI course  $\rightarrow$  Chap.9-(1) Virtual topologies

\*) Figure: similar to x,y-diagrams, first index is horizontal (i.e., not vertical as in a math matrix)

## Virtual Topologies

- **Convenient process naming.**
- **Naming scheme to fit the communication pattern.**
- **Simplifies writing of code.**
- Can allow MPI to optimize communications  $\rightarrow$  see course Chapter 9-(3)

## How to use a Virtual Topology

- **Creating a topology produces a new communicator.**
- **NPI provides mapping functions:** 
	- to compute process ranks, based on the topology naming scheme,
	- **and vice versa.**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 185/239

# Topology Types

- Cartesian Topologies
	- each process is *connected* to its neighbor in a virtual grid,
	- boundaries can be cyclic, or not,
	- processes are identified by Cartesian coordinates,
	- of course, communication between any two processes is still allowed.
- Graph Topologies
	- **general graphs,**
	- two interfaces:
		- MPI GRAPH CREATE (since MPI-1)
		- MPI\_DIST\_GRAPH\_CREATE\_ADJACENT & MPI\_DIST\_GRAPH\_CREATE (new scalable interface since MPI-2.2)
	- not covered here.

П

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 186/239

# Creating a Cartesian Virtual Topology

**comm\_old = MPI\_COMM\_WORLD ndims = 2**  $dimS = (4, 4)$ **periods** =  $(1, 0)$  (in C) **periods = ( .true., .false. )** (in Fortran) **reorder =** *see next slide* int MPI\_Cart\_create(MPI\_Comm comm\_old, int ndims, int \*dims, int \*periods, int reorder, MPI\_Comm \**comm\_cart*) MPI\_CART\_CREATE(comm\_old, ndims, dims, periods, reorder, *comm\_cart*, *ierror*) mpi\_f08: TYPE(MPI Comm) :: comm\_old, comm\_cart INTEGER :: ndims, dims(\*) LOGICAL :: periods(\*), reorder INTEGER, OPTIONAL :: ierror **C/C++ Fortran 0 (0,0) 3 (1,0) 6 (2,0) 9 (3,0) 1 (0,1) 4 (1,1) 7 (2,1) 10 (3,1) 2 (0,2) 5 (1,2) 8 (2,2) 11 (3,2)** e.g., size==12 factorized with MPI Dims\_create(), see later the slide "**Typical usage of MPI\_Cart\_create & MPI\_Dims\_create**"

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



**This reordering can allow MPI to optimize communications.** 

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Typical use of MPI\_Cart\_create & MPI\_Dims\_create

#### **#define ndims 3**

**int i, nnodes, world\_myrank, cart\_myrank, dims[ndims], periods[ndims], my\_coords[ndims]; MPI\_Comm comm\_cart; MPI\_Init(NULL,NULL); MPI\_Comm\_size(MPI\_COMM\_WORLD,** *&numprocs***); MPI\_Comm\_rank(MPI\_COMM\_WORLD,** *&world\_myrank***); for (i=0; i<ndims; i++) { dims[i]=0; periods[i]=…; } MPI\_Dims\_create(numprocs, ndims,** *dims***); // computes factorization of numprocs MPI\_Cart\_create(MPI\_COMM\_WORLD, ndims, dims, periods,1, &***comm\_cart***); MPI\_Comm\_rank(comm\_cart,** *&cart\_myrank***); MPI\_Cart\_coords(comm\_cart, cart\_myrank, ndims,** *my\_coords***,** *ierror***)** With reorder

#### **From now on: all communication should be based on comm\_cart & cart\_myrank & my\_coords**



#### <span id="page-188-0"></span>Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



## Cartesian Mapping Functions



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## Cartesian Mapping Functions



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

### A process' own coordinates



**Each process gets its own coordinates with (example in Fortran) call MPI\_Comm\_rank(comm\_cart,** *my\_rank***,** *ierror***) call MPI\_Cart\_coords(comm\_cart, my\_rank, maxdims,** *my\_coords***,** *ierror***)**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## Ranks of neighboring processes



- Returns MPI\_PROC\_NULL if there is no neighbor.
- MPI\_PROC\_NULL can be used as source or destination rank in each communication  $\rightarrow$  Then, this communication will be a no-operation!

### MPI\_Cart\_shift – example



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



- Cut a virtual process grid up into *slices*.
- A new communicator is produced for each slice.
- **Each slice can then perform its own collective communications.**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



**Ranks** and **Cartesian process coordinates** in **comm\_slice**



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Sparse Collective Operations on Process Topologies

- Sparse neighbor communication New in MPI-3.0 within MPI process topologies (Cartesian and (distributed) graph):
	- MPI\_(I)NEIGHBOR\_ALLTOALL (V,W)
	- $MPI_{1}(I)NEIGHBOR_{1}(I)ALLGATHER (V)$  = perfect scalable !?
- If the topology is the full graph, then neighbor routine is identical to full collective communication routine
	- Exception: s/rdispls in MPI\_NEIGHBOR\_ALLTOALLW are MPI\_Aint
- Allows for optimized communication scheduling and scalable resource binding
- Cartesian topology:
	- Sequence of buffer segments is communicated with:
		- direction=0 source, direction=0 dest, direction=1 source, direction=1 dest, …
		- Defined only for disp=1 (direction, source, dest and disp are defined as in MPI\_CART\_SHIFT)
	- If a source or dest rank is MPI\_PROC\_NULL then the buffer location is still there but the content is not touched.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 197/239

П

#### Periodic MPI\_NEIGHBOR\_ALLTOALL in direction *d* with 4 processes

Clarified in MPI-4.0



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



**After MPI\_NEIGHBOR\_ALLTOALL on a Cartesian communicator returned, the content of the recvbuf is as if the following code is executed:**

```
MPI_Cartdim_get(comm, &ndims); 
for( /*direction*/ d = 0; d < ndims; d++) {
  MPI Cart shift(comm, /*direction*/ d, /*disp*/ 1, &rank source, &rank dest);
  MPI_Sendrecv(sendbuf[d*2+0], sendcount, sendtype, rank_source, /*sendtag*/ d*2,
                recvbuf[d*2+1], recvcount, recvtype, rank_dest, /*recvtag*/ d*2,
                comm, &status); /* 1st communication in direction of displacment -1 */
  MPI Sendrecv(sendbuf[d*2+1], sendcount, sendtype, rank dest, /*sendtag*/ d*2+1,
                recvbuf[d*2+0], recvcount, recvtype, rank_source, /*recvtag*/ d*2+1,
                comm, &status); /* 2nd communication in direction of displacment +1 */
}
```
**The tags are chosen to guarantee that both communications (i.e., in negative and positive direction) cannot be mixed up, even if the MPI\_SENDRECV is substituted by nonblocking communication and the MPI\_ISEND and MPI\_IRECV calls are started in any sequence.**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

As if …

Skipped

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 199/239





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)





Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

## **Programming models - pure MPI**

#### **Rank renumbering for optimization**

General considerations The topology problem The topology problem: How-to / Virtual Toplogies **> Rank renumbering for optimization** The Topology Problem: Unstructured Grids Quiz **Scalability** Advantages & disadvantages, conclusions

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem – wrap up

# Rank renumbering for optimization

#### **When is it not needed?**

 $\rightarrow$  Hybrid MPI+OpenMP with 1, 2, or 3 MPI processes per shared-memory node

#### **When is it not helpful?**

 $\rightarrow$  Dynamic load balancing that changes the process-to-process communication pattern (typically only with graph topologies)

#### **When do we need it?**

- $\rightarrow$  Communication win with  $\geq$  4 MPI processes per shared-memory node
- Example with 6 or 8 processes per shared-memory node:
	- Sequential ranking  $6x1x1$  or  $8x1x1$  topology  $\rightarrow$  26 or 34 inter-node neighbors in MPI\_COMM\_WORLD
	- Renumbered  $3x2x1$  or  $2x2x2$  topology  $\rightarrow$  22 or 24 inter-node neighbors  $\rightarrow$  15% or 29% win via Cartesian topo.
- **How can we implement it?**  $\rightarrow$  MPI virtual topologies

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### Rank renumbering for optimization – problems

- 1. All MPI libraries provide the necessary interfaces **,** but **without** renumbering in some MPI-libraries
- 2. The existing MPI-4.0 interfaces are not optimal:
	- Application topology awareness: application-specific data mesh sizes or direction-dependent communication requirements are not accounted for  $\rightarrow$  next slide
	- Hardware topology awareness:

the factorization of the number of processes into several dimensions cannot leverage hardware topology information  $\rightarrow$  next slide

- 3. The application must be prepared for rank renumbering
	- Ideally,data distribution happens after renumbering (see slide  $\Box$

Typical use of Cart\_create & MPI\_Dims\_create

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

#### The existing MPI-4.0 interfaces are not optimal: examples

#### **EXPLO Application topology awareness**

- 2-D example with 12 MPI processes and data mesh size 1800x580
	- MPI Dims create  $\rightarrow$  4x3 data mesh aware  $\rightarrow$  6x2 processes





- **Hardware topology awareness** 
	- 2-D example with 25 nodes x 24 cores and data mesh size 3000x3000
		-
		- $\blacksquare$  MPI\_Dims\_create  $\rightarrow$  25 x 24  $\blacksquare$   $\blacksquare$  Hardware aware  $\rightarrow$  30 x 20 = (**5 nodes** x 6 cores) **X** (**5 nodes** x 4 cores)



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  How-to: Virtual MPI topologies

ш

### Goals of MPI\_Dims\_create + MPI\_Cart\_create

- Given: comm\_old (e.g., MPI\_COMM\_WORLD), ndims (e.g., 3 dimensions)
- **Provide** 
	- **a factorization** of #processes (of comm\_old) into the dimensions  $\dim s[i]_{i=1}$ . ndims
	- a Cartesian communicator **comm\_cart**
	- an **optimized reordering** of the ranks in comm\_old into the ranks of comm\_cart to minimize the Cartesian communication time, e.g., of
		- □ MPI\_Neighbor\_alltoall
		- □ Equivalent communication pattern implemented with
			- MPI Sendrecv
			- Nonblocking MPI point-to-point communication

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 206/239

#### The limits of MPI\_Dims\_create + MPI\_Cart\_create

- Not application topology aware
	- MPI\_Dims\_create can **only** map **evenly balanced** Cartesian topologies
		- Factorization of 48,000 processes into 20 x 40 x 60 processes (e.g. for a mesh with 200 x 400 x 600 mesh points)  $\rightarrow$  no chance with current interface
- Only partially hardware topology aware
	- $\blacksquare$  MPI\_Dims\_create without comm arg.  $\rightarrow$  not hardware aware
		- An application mesh with 3000x3000 mesh points on 25 nodes x 24 cores (=600 MPI processes)
			- □ Answer from MPI\_Dims\_create:
				- 25 x 24 MPI processes
				- Mapped by most libraries to 25 x 1 nodes with 120 x 3000 mesh points per node
					- $\rightarrow$  too much node-to-node communication

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 207/239

Major problems: •No weights,

- no info
- •Two separated interfaces for two common tasks:
- Factorization of #processes
- Mapping of the processes to the hardware

п

## Goals of Cartesian MPI\_Dims+Cart\_create

- Remark: On a hierarchical hardware,
	- **optimized factorization and reordering** typically means **minimal node-to-node** communication,
	- which typically means that the communicating surfaces of the data on each node is as quadratic as possible (or the subdomain as cubic as possible)
- The current API, i.e.,
	- due to the missing weights
	- and the non-hardware aware MPI Dims create,

does **not** allow such an optimized factorization & reordering in many cases.

## The new interface – proposed for MPI-4.1



- Arguments have same meaning as in MPI\_Dims\_create
- Goal (in absence of an info argument):
	- dims[i]•dim\_weights[i] should be as close as possible,
	- $\blacksquare$  i.e., the  $\sum_{i=0..(ndims-1)}$  dims[i]•dim\_weights[i] as small as possible (advice to implementors)

A new courtesy function: **Weighted factorization**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  How-to: Virtual MPI topologies

**CONTRACTOR** 

#### The new interface – proposed for MPI-4.1, continued



■ Arguments: see existing MPI\_Dims\_create & MPI\_Cart\_create **/** dim\_weights[ndims]  $\rightarrow$  next slide

- Goals: Choose an ndims-dimensional factorization of #processes of comm old ( $\rightarrow$  dims)
	- **and** an appropriate reordering of the ranks ( $\rightarrow$  comm\_cart),

such that the execution time of a communication step along the virtual process grid is minimal (e.g., with MPI\_NEIGHBOR\_ALLTOALL, MPI\_SENDRECV, or nonblockuing MPI\_ISEND/IRECV)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  How-to: Virtual MPI topologies

ш

# How to specify the dim\_weights?

- Given: comm\_old (e.g., MPI\_COMM\_WORLD), ndims (e.g., 3 dimensions)
- This means, **the domain decomposition has not yet taken place**!
- Goals for dim\_weights and the API at all:
	- **Easy to understand**
	- **Easy to calculate**
	- Relevant for typical Cartesian communication patterns (MPI\_Neighbor\_alltoall or similar)
	- Rules fit to usual design criteria of MPI
		- E.g., reusing MPI\_UNWEIGHTED  $\rightarrow$  integer array
		- Can be enhanced by vendors for their platforms  $\rightarrow$  additional info argument for further specification
		- To provide also the less optimal two stage interface (in addition to the combined routine)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 211/239

### The dim\_weights[ $i$ ], example with 3 dimensions



The arguments  $\mathbf{dim\_weights}[i]$   $i$  =0::(ndims-1), abbreviated with  $\boldsymbol{w}_i,$ should be specified as the accumulated message size (in bytes) communicated in one communication step through each **cutting plane**  orthogonal to dimension  $d_i$  and in each of the two directions.

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien) Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 212/239

#### The dim\_weights[ $i$ ], example with 3 dimensions, continued



Example for the calculation of the accumulated communication size  $w_{i,i=0,2}$  in each dimension. Given:

- $g_i$  The data mesh sizes  $g_{i,i=0,2}$  express the three dimensions of the total application data mesh.
- $\blacksquare$   $h_i$  The value  $h_i$  represents the halo width in a given direction when the 2-dimensional side of a subdomain is communicated to the neighbor process in that direction.

Output from MPI\_Cart/Dims\_create\_weighted: The dimensions  $d_{i,i=0,2}$ 

#### Important:

• The definition of the dim\_weights (=  $w_i$  in this figure) is **independent** of the total number of processes and its factorization into the dimensions (=  $d_i$  in this figure)

• Result was

$$
w_i = h_i \frac{\prod_j g_j}{g_i}
$$

 $\mathbf{r}$ 

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 213/239

## The new interfaces – a real implementation

#### Substitute for / enhancement to existing MPI-1

- MPI\_Dims\_create (size\_of\_comm\_old, ndims, *dims[ndims]* );
- MPI\_Cart\_create (comm\_old, ndims, dims[ndims], periods, reorder, *\*comm\_cart*);

New: (in MPI/tasks/C/Ch9/MPIX/)

**MPIX\_Cart\_weighted\_create** (



• MPIX Dims weighted create ( int nnodes, int ndims, double dim weights[ndims], */\*OUT\*/ int dims[ndims]* );

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  How-to: Virtual MPI topologies

ш

### Further Interfaces

- We proposed the algorithm in
	- Christoph Niethammer and Rolf Rabenseifner. 2018. Topology aware Cartesian grid mapping with MPI. EuroMPI 2018.
	- <https://eurompi2018.bsc.es/>  $\rightarrow$  Program  $\rightarrow$  Poster Session  $\rightarrow$  Abstract+Poster
	- <https://fs.hlrs.de/projects/par/mpi/EuroMPI2018-Cartesian/> $\rightarrow$  All info + slides + software
	- <http://www.hlrs.de/training/self-study-materials>  $\rightarrow$  Practical  $\rightarrow$  MPI31.tar.gz  $\rightarrow$  MPI/tasks/C/eurompi18/
	- More details, see this talk+slides "Hybrid Programming in HPC MPI+X"
- Full paper:
	- Christoph Niethammer, Rolf Rabenseifner:



An MPI interface for application and hardware aware cartesian topology optimization. EuroMPI 2019.

Proceedings 26th European MPI Users' Group Meeting, Sep. 2019, article No. 6, p. 1-8,<https://doi.org/10.1145/3343211.3343217>

- MPIX\_Dims\_weighted\_create() is based on the ideas in:
	- Jesper Larsson Träff and Felix Donatus Lübbe. 2015. Specification Guideline Violations by MPI Dims Create. In Proceedings of the 22nd European MPI Users' Group Meeting (EuroMPI '15). ACM, New York, NY, USA, Article 19, 2 pages.
- Another approach using the existing MPI\_Cart\_create() interface:
	- W. D. Gropp, Using Node [and Socket] Information to Implement MPI Cartesian Topologies, Parallel Computing, 2019. And Proceedings of the 25th European MPI User' Group Meeting, EuroMPI'18, ACM, New York, NY, USA, 2018, pp. 18:1-18:9. [doi:10.1145/3236367.3236377](https://doi.org/10.1145/3236367.3236377). Slides:<http://wgropp.cs.illinois.edu/bib/talks/tdata/2018/nodecart-final.pdf>

**Hipped**  The portable MPIX routines internally use MPI\_Comm\_split\_type(…, MPI\_COMM\_TYPE\_SHARED, …) to split comm\_old into ccNUMA nodes,

- plus (may be) additionally splitting into NUMA domains.
- With using hyperthreads, it *may be helpful*  to apply **sequential** ranking to the hyperthreads,
	- **EXECT:** i.e., in MPI\_COMM\_WORLD, ranks 0+1 should be
		- the first two hyperthreads
		- of the first core

Remarks

- of the first CPU
- of the first ccNUMA node
- **E** Especially with weights  $w_i$  based on  $\frac{G}{g_i}$ , it is important
	- that the data of the mesh points is **not** read in based on (**old**) ranks in MPI\_COMM\_WORLD,
	- because the domain decomposition must be done based on **comm\_cart** and its dimensions and (**new**) ranks

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Programming models → pure MPI → How-to: Virtual MPI topologies 216/239
# Questions addressed in this tutorial

Where we are?

- What is the performance impact of system topology? J Communication time Memory access time
	- How do I map my programming model on the system to my advantage?
		- $\blacksquare$  How do I do the split into MPI $+X$ ?
		- Where do my processes/threads run? How do I take control?
		- Where is my data?
		- How can I minimize communication overhead? -Through rank reordering
- How does hybrid programming help with typical HPC problems?
	- Can it reduce communication overhead?  $\qquad$  |
	- Can it reduce replicated data?
- How can I leverage multiple accelerators?
	- What are typical challenges?

rank reordering may still help if ≥ 4 MPI processes per SMP node

### Typical use of MPIX\_Cart\_weighted\_create

```
#define ndims 3
int i, nnodes, world_myrank, cart_myrank, dims[ndims], periods[ndims], my_coords[ndims];
int global array dim[ndims], halo width[ndims], local array dim[ndims], local array size=1;
double dim_weights[ndims], global_array_size=1.0;
MPI_Comm comm_cart;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &world_myrank);
for (i=0; i \leq n \leq n \leq j++) {
  dims[i]=0; periods[i]=…;
  global_array_dim[i]=…; halo_width[i]=…;
  global_array_size = global_array_size * (double)(global_array_dim[i]);
}
for (i=0; i<ndims; i++) { 
  dim_weights[i] = (double)(halo_width[i]) * global_array_size / (double)(global_array_dim[i]);
} 
MPIX_Cart_weighted_create(MPI_COMM_WORLD, ndims, dim_weights, dims, periods, MPI_INFO_NULL, dims, &comm_cart);
MPI_Comm_rank(comm_cart, &cart_myrank);
MPI_Cart_coords(comm_cart, cart_myrank, ndims, my_coords, ierror)
for (i=0; i<ndims; i++) { 
  local_array_dim[i] = global_array_dim[i] / dims[i];
  local_array_size = local_array_size * local_array_dim[i];
} 
local_data_array = malloc(sizeof(…) * local_array_size);
                                                                From now on: 
                                                                all communication should be based on 
                                                                comm_cart & cart_myrank & my_coords
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  How-to: Virtual MPI topologies

## Virtual Cartesian MPI topologies – summary

- **Relevant for modern clusters comprising multicore nodes**
- **Optimizes only the communication**

If communication is irrelevant (in  $\epsilon$ )  $\rightarrow$  don't care about reordering (observe cost/benefit)

- The new (and weighted) optimizing routines are easy to use for Cartesian problems
- Be aware that the MPI\_Cart\_...\_create routines  $\text{ightharpoonup}$  of course with reorder=true renumber the communicator

# **Programming models - pure MPI**

### **The Topology Problem: Unstructured Grids**

General considerations The topology problem The topology problem: How-to / Virtual Toplogies Rank renumbering for optimization **> The Topology Problem: Unstructured Grids** Quiz **Scalability** 

Advantages & disadvantages, conclusions

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem – wrap up

# Virtual MPI Topologies – unstructured grids

- See paper from Torsten Höfler and references in Bill Gropp's paper:
	- T. Hoefler and M. Snir. 2011. Generic Topology Mapping Strategies for Large-scale Parallel Architectures. In *Proceedings of the 2011 ACM International Conference on Supercomputing (ICS'11)*. ACM, 75–85.
	- Bill Gropp. 2018. Using Node Information to Implement MPI Cartesian Topologies. In *Proceedings of the 25nd European MPI Users' Group Meeting (EuroMPI '18)*, September 23–26, 2018, Barcelona, Spain. ACM, New York, NY, USA, 9 pages.
- Many MPI libraries still do not optimize the graph topologies ...
	- an (not too complicated) alternative is shown on next slides
- Additional application problem: your application may read data in before creating the virtual graph topology
	- The re-numbering of the processes may require that you
		- send such data to the new process (with the old rank), or
		- need to re-read such data from file system

# Hierarchical DD for unstructured grids

- Single-level DD (finest level)
	- Analysis of the communication pattern in a first run (with only a few iterations)
	- Optimized rank mapping to the hardware before production run
	- E.g., with CrayPAT + CrayApprentice (not verified by us authors)
- Multi-level DD:
	- Top-down: Several levels of (Par)Metis  $\rightarrow$  unbalanced communication



- Bottom-up: Low level DD
	- + higher level recombination
		- $\rightarrow$  based on DD of the grid of subdomains

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem – wrap up

# Unstructured Grid / Data Mesh

- Mesh partitioning with special load balancing libraries
	- Metis (George Karypis, University of Minnesota)
	- ParMetis (internally parallel version of Metis)
		- <http://glaros.dtc.umn.edu/gkhome/views/metis/metis.html>
	- **Scotch & PT-Scotch** (Francois Pellegrini, LaBRI, France)
		- <https://www.labri.fr/perso/pelegrin/scotch/>
	- Goals:
		- Same work load in each sub-domain
		- Minimizing the maximal number of neighbor-connections between sub-domains
		- Minimizing the total number of neighbor sub-domains of each sub-domain

The weighted communication graph of the virtual process grid can be used as input for MPI\_Dist\_graph\_create(\_adjacent)



Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem – wrap up

## Unstructured Grid / Data Mesh

### Multi-level Domain Decomposition through Recombination



e.g., with Metis / Scotch

- **Problem:** Recombination must **not** calculate patches that are smaller or larger than the average
- In this example the load-balancer **must** combine always
	- 6 cores, and
	- 4 numa-domains (i.e., sockets or dies)
- **Advantage:** Communication is balanced!

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

ш

# **Quiz on Virtual topologies**

- A. Which types of MPI topologies for virtual process grids exist?
- B. And for which use cases?



\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_

C. Where are limits for using virtual topologies, i.e., which use cases do not really fit?

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien) Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Topology problem  $\rightarrow$  Quiz 225 239



# **Programming models - pure MPI**

### **Scalability**

General considerations The topology problem The topology problem: How-to / Virtual Toplogies Rank renumbering for optimization The Topology Problem: Unstructured Grids Quiz **> Scalability**

Advantages & disadvantages, conclusions

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Scalability

# To overcome MPI scaling problems

- MPI has a few scaling problems with more than 10,000 MPI processes
	- MPI\_Alltoall<sup>\*</sup> is not scalable with longer messages
	- Irregular Collectives: MPI ....V, e.g. MPI\_Gatherv
		- $\triangleright$  Scaling applications should not use MPI\_....v routines
	- MPI Graph topology (MPI Graph create)
		- Use scalable interface MPI\_Dist\_graph\_create\_adjacent
	- Creation of many disjoint sub-communicators
		- Creation possible in a single call to MPI\_Comm\_split or MPI\_Comm\_create
	- MPI internal memory consumption for, e.g.,
		- $\triangleright$  Internal data structures for large communicators
		- $\triangleright$  Internal communication buffers
	- … see also P. Balaji, et al.: **MPI on a Million Processors.**

P. Balaji, D. Buntinas, D. Goodell, W. Gropp, T. Hoefler, S. Kumar, E. Lusk, R. Thakur, and J. L. Traff: MPI on Millions of Cores. Parallel Processing Letters, 21(01):45-60, 2011. Originally, Proceedings EuroPVM/MPI 2009.

Hybrid programming reduces all these problems (due to a smaller number of processes)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Scalability

Protocol switches are implementation dependent

Current implementations consider this

# **Programming models - pure MPI**

### **Advantages & disadvantages, conclusions**

General considerations The topology problem The topology problem: How-to / Virtual Toplogies Rank renumbering for optimization The Topology Problem: Unstructured Grids Quiz **Scalability** 

**> Advantages & disadvantages, conclusions**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  pure MPI  $\rightarrow$  Conclusions

# Pure MPI communication: Main advantages

- **Simplest programming model**
- Library calls need not to be thread-safe
- The hardware is typically prepared for many MPI processes per SMP node
- Only minor problems if pinning is not applied
- No first-touch problems as with OpenMP (in hybrid MPI+OpenMP)

## Pure MPI communication: Main disadvantages

- Unnecessary communication
- Too much memory consumption for
	- halo data for communication between MPI processes on same SMP node
	- other replicated data on same SMP node
	- MPI buffers due to the higher number of MPI processes
- Additional programming costs for minimizing node-to-node communication,
	- i.e., for optimizing the communication topology,
	- e.g., implementing the multi-level domain-decomposition
- No efficient use of hardware-threads (hyper-threads)

## Pure MPI communication: Conclusions

- Still a good programming model for small and medium size applications.
- **Major problem may be memory consumption**

## **Conclusions**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X Conclusions 232/239

# Major advantages of hybrid MPI+OpenMP

### In principle, none of the programming models perfectly fits to clusters of SMP nodes

### Major advantages of MPI+OpenMP:

- Only one level of sub-domain "surface-optimization":
	- SMP nodes, or
	- Sockets or NUMA domains
- **Second level of parallelization** 
	- Application may scale to more cores
- Smaller number of MPI processes implies:
	- Reduced size of MPI internal buffer space
	- Reduced space for replicated user-data

**Most important arguments on many-core systems**

### Major advantages of hybrid MPI+OpenMP, continued

### **Reduced communication overhead**

- No intra-node communication
- Longer messages between nodes and fewer parallel links may imply better bandwidth
- **"Cheap" load-balancing methods** on OpenMP level
	- Application developer can split the load-balancing issues between coursegrained MPI and fine-grained OpenMP

# Disadvantages of MPI+OpenMP

- Using OpenMP
	- $\rightarrow$  may prohibit compiler optimization
	- **may cause significant loss of computational performance**
- **Thread fork / join overhead**
- On ccNUMA SMP nodes:
	- Loss of performance due to missing memory page locality or missing first touch strategy
	- E.g., with the MASTERONLY scheme:
		- One thread produces data
		- Master thread sends the data with MPI
		- $\rightarrow$  data may be internally communicated from one NUMA domain to the other one
- Amdahl's law for each level of parallelism
- Using MPI-parallel application libraries?  $\rightarrow$  Are they prepared for hybrid?
- Using thread-local application libraries?  $\rightarrow$  Are they thread-safe?

### **MPI+3.0 shared memory**

- **Pro:** Thread-safety is not needed for libraries.
- Con: No work-sharing support as with OpenMP directives.
- Pro: Replicated data can be reduced to one copy per node: May be helpful to save memory, **if pure MPI scales in time, but not in memory**
- Substituting intra-node communication by shared memory loads or stores has only limited benefit (and only on some systems), especially if the communication time is dominated by inter-node communication
- Con: No reduction of MPI ranks  $\rightarrow$  no reduction of MPI internal buffer space
- **Con:** Virtual addresses of a shared memory window may be different in each MPI process  $\rightarrow$  no binary pointers
	- $\rightarrow$  i.e., linked lists must be stored with offsets rather than pointers

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X Conclusions 236/239

### Lessons for pure MPI and ccNUMA-aware hybrid MPI+OpenMP

- MPI processes on an SMP node should form a cube and not a long chain
	- Reduces inter-node communication volume
- For structured or Cartesian grids:
	- Adequate renumbering of MPI ranks and process coordinates
- For unstructured grids:
	- Two levels of domain decomposition
		- First fine-grained on the core-level
		- Recombining cores to SMP-nodes

## Acknowledgements

- **We want to thank** 
	- Gabriele Jost, Supersmith, Maximum Performance Software, USA
		- Co-author of several slides and previous tutorials
	- **Irene Reichl, VSC, TU Wien, Vienna, Austria** 
		- Co-author of the date-rep exercise
	- Gerhard Wellein, RRZE
	- Alice Koniges, NERSC, LBNL
	- Rainer Keller, HLRS and ORNL
	- Jim Cownie, Intel
	- SCALASCA/KOJAK project at JSC, Research Center Jülich
	- HPCMO Program and the Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS
	- Steffen Weise, TU Freiberg
	- Vincent C. Betro et al., NICS access to beacon with Intel Xeon Phi
	- Christoph Niethammer, HLRS
	- Jesper Träff, TU Wien, Faculty of Informatics
	- Bill Gropp, National Center for Supercomputing Applications, University of Illinois Urbana-Champaign

## **Conclusions**

- Future hardware will be more complicated
	- Heterogeneous  $\rightarrow$  GPU, FPGA, ...
	- Node-level ccNUMA is here to stay, but will only be one of your problems
	- ….
- **High-end programming**  $\rightarrow$  **more complex**  $\rightarrow$  **many pitfalls**
- Medium number of cores  $\rightarrow$  more simple (**#cores / SMP-node** still grows)
- MPI + OpenMP  $\rightarrow$  workhorse on large systems
	- Major pros: **reduced memory needs** and **second level of parallelism**
- MPI + MPI shared memory → only for special cases and medium #processes
- Pure MPI communication  $\rightarrow$  still viable if it does the job
- OpenMP only  $\rightarrow$  on large ccNUMA nodes (almost gone in HPC)



# **Appendix**

**Abstract Authors Solutions**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

 $R$  Hybrid Programming – MPI+X → Appendix 240

### Abstract

#### MPI+X – Introduction to Hybrid Programming in HPC

Tutorial (Content levels: 0:00h [=0%] Beginners, 1:30h [=10%] Intermediate, 13:30h [=90%] Advanced)

Authors: Claudia Blaas-Schenner, VSC Research Center, TU Wien, Vienna, Austria Georg Hager, Erlangen Regional Computing Center (RRZE), University of Erlangen, Germany Rolf Rabenseifner, High Performance Computing Center (HLRS), University of Stuttgart, Germany

Abstract: Most HPC systems are clusters of shared memory nodes. To use such systems efficiently both memory consumption and communication time has to be optimized. Therefore, hybrid programming may combine the distributed memory parallelization on the node interconnect (e.g., with MPI) with the shared memory parallelization inside of each node (e.g., with OpenMP or MPI-3.0 shared memory). This course analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes. Multi-socket-multi-core systems in highly parallel environments are given special consideration. MPI-3.0 has introduced a new shared memory programming interface, which can be combined with inter-node MPI communication. It can be used for direct neighbor accesses similar to OpenMP or for direct halo copies, and enables new hybrid programming models. These models are compared with various hybrid MPI+OpenMP approaches and pure MPI. Numerous case studies and micro-benchmarks demonstrate the performance-related aspects of hybrid programming.

Hands-on sessions are included on all days. Tools for hybrid programming such as thread/process placement support and performance analysis are presented in a "how-to" section. This course provides scientific training in Computational Science and, in addition, the scientific exchange of the participants among themselves.

URL: 2022-HY-VSC-Dec <https://vsc.ac.at/training/2022/HY-VSC-Dec> 2022-HY-LRZ <http://www.hlrs.de/training/2022/HY-LRZ> 2022-HY-VSC <http://vsc.ac.at/training/2022/HY-VSC>

2021-HY-VSC <http://vsc.ac.at/training/2021/HY-VSC>

2020-HY-VSC <http://vsc.ac.at/training/2020/HY-VSC> 2020-HY-S <http://www.hlrs.de/training/2020/HY-S>

2019-HY-G [https://www.lrz.de/services/compute/courses/archive/2019/2019-01-28\\_hhyp1w18/](https://www.lrz.de/services/compute/courses/archive/2019/2019-01-28_hhyp1w18/)

ISC 2017 <https://www.isc-hpc.com/agenda2017/sessiondetails23ac.html?t=session&o=510>

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X → Appendix → Abstract 241

## Claudia Blaas-Schenner



Claudia Blaas-Schenner holds a diploma (1992) and PhD (1996) in Technical Physics from TU Wien (Vienna, Austria). As a postdoc and later as a research fellow she continued to work in computational materials science, both at TU Wien and at the University of Vienna, with research stays at TU Dresden (Germany) and at the Academy of Sciences of the Czech Republic in Prague (Czech Republic). In 2014 she joined the VSC Research Center at TU Wien (Vienna, Austria), where she is responsible for developing a training and education program in HPC. She develops the curriculum of the training courses and teaches mainly parallel programming with MPI and OpenMP as well as hybrid programming techniques MPI+X. In addition, she is involved in performance optimization of user codes. Claudia is an active member of the MPI Forum, which is the standardization body for the Message Passing Interface (MPI; [https://www.mpi](https://www.mpi-forum.org)forum.org), and is acting as a chapter committee chair for "MPI Terms and Conventions", which is essential for the MPI standard as a whole [\(https://www.mpi-forum.org/mpi-41](https://www.mpi-forum.org/mpi-41)). For PRACE-6IP she is the project manager at TU Wien, leads the recently established PRACE Training Center (PTC) at the VSC Research Center of TU Wien, and acts as the Management Board representative of Austria in PRACE-6IP.

# Georg Hager



Georg Hager holds a PhD and a Habilitation degree in Computational Physics from the University of Greifswald. He leads the Training & Support Division at Erlangen National High Performance Computing Center (NHR@FAU) and is an associate lecturer at the Institute of Physics at the University of Greifswald. Recent research includes architecture-specific optimization strategies for current microprocessors, performance engineering of scientific codes on chip and system levels, and the analytic modeling of structure formation in large-scale parallel codes. Georg Hager has authored and co-authored more than 100 peerreviewed publications and was instrumental in developing and refining the Execution-Cache-Memory (ECM) performance model and energy consumption models for multicore processors. In 2018, he won the "ISC Gauss Award" (together with Johannes Hofmann and Dietmar Fey) for a paper on accurate analytic performance and power modeling. He received the "2011 Informatics Europe Curriculum Best Practices Award" (together with Jan Treibig and Gerhard Wellein) for outstanding contributions to teaching in computer science. His textbook "Introduction to High [Performance](https://blogs.fau.de/hager/hpc-book) Computing for Scientists and Engineers" is recommended or required reading in many HPC-related lectures and courses worldwide. Together with colleagues from FAU, HLRS Stuttgart, and TU Wien he develops and conducts successful international tutorials on node-level performance engineering and hybrid programming. A full list of publications, talks, and other things he is interested in can be found in his blog: [https://blogs.fau.de/hager.](https://blogs.fau.de/hager)

### Rolf Rabenseifner



Rolf Rabenseifner studied mathematics and physics at the University of Stuttgart. Since 1984, he is working at the High-Performance Computing-Center Stuttgart (HLRS). He led the projects DFN-RPC, a remote procedure call tool, and MPI-GLUE, the first metacomputing MPI combining different vendors' MPIs without losing the full MPI interface. In his dissertation he developed a controlled logical clock as global time for trace-based profiling of parallel and distributed applications. Since 1996, he is a member of the MPI-2 Forum and since Dec. 2007, he is in the steering committee of the MPI-3 Forum. Rolf was responsible for the MPI-2.1 version of the standard. From January to April 1999, he was an invited researcher at the Center for High-Performance Computing at the TU Dresden. Currently, he is head of Parallel Computing - Training and Application Services at HLRS. He is involved in MPI profiling and benchmarking, e.g., in the HPC Challenge Benchmark Suite. In recent projects, he studied parallel I/O, parallel programming models for clusters of SMP nodes, and optimization of MPI collective routines. In workshops and summer schools he teaches parallel programming models at many universities and labs in Germany, and also in Austria and Switzerland. In January 2012, the Gauss Centre of Supercomputing (GCS), with HLRS, LRZ in Garching and the Jülich Supercomputing Center as members, was selected as one of six PRACE Advanced Training Centers (PATCs). Rolf Rabenseifner was appointed as the GCS' PATC director.

- **Solution files:** 
	- data-rep\_sol\_2a.c
	- data-rep\_sol\_2d.c
	- data-rep\_sol\_2f.c
	- data-rep\_sol\_3-6.c
	- data-rep\_sol\_7.c
	- data-rep\_solution.c
- Quiz solution

#### $arr = (arrTwo +)$  malloc(arrSize \* sizeof(arrType)); MPI Comm split type (MPI COMM WORLD, MPI COMM TYPE SHARED,  $/*key =*/ 0,$ MPI<sup>INFO</sup>NULL, &comm\_shm); MPI Comm\_size(comm\_shm,  $\overline{ }$  &size\_shm); MPI<sup>C</sup>omm<sup>-</sup>rank(comm<sup>-shm, &rank<sup>-shm</sup>);</sup> if  $\overline{()}$  rank shm ==  $\overline{()}$   $\{$  individualShmSize = arrSize ;  $\}$ else  $\overline{\phantom{a}}$  { individualShmSize = 0 ; } MPI Win allocate shared(  $(\overline{MPI\text{ Aint}})(indivalShmsize) * (MPI\text{ Aint})(sizeof(arrType)),$ sizeof(arrType), MPI\_INFO\_NULL, comm\_shm, &shm\_buf\_ptr, &win ); MPI Win shared query( win, 0, &arrSize, &disp\_unit, &arr ); color=MPI\_UNDEFINED ; if (rank shm==0) { color =  $0;$  } MPI Comm  $\overline{s}$ plit(MPI COMM WORLD,  $\overline{color}$ , /\*key=\*/ 0, &comm head); if( $\overline{comm}$  head != MPI COMM NULL ) {MPI\_Comm\_size(comm\_head, &size\_head);MPI\_Comm\_rank(comm\_head, &rank\_head);} MPI Win fence(/\*workaround: no assertions:\*/ 0, win); <u>Starting write epoch</u> if(rank\_world==0) for( i=0; i<arrSize; i++) arr[i]=i+it; Filling arr **by process 0** if( comm head != MPI COMM NULL )  $\left\{\frac{\text{Only the heads of the shared memory islands fill arr by ...}\right\}$ **MPI\_Bcast(arr,** arrSize, arrDataType, 0, **comm\_head);** ... broadcasting to all heads } MPI\_Win\_fence(/\*workaround:no assertions:\*/0,win); Starting read epoch by all proc's sum=0; for( i=0; i<arrSize; i++) sum+= arr[i]; **Reading arr by all processes C** MPI/tasks/C/Ch11/data-rep/data-rep\_solution.c process is head of one of the shared memory islands **grey = original code** instead of **MPI\_COMM\_WORLD** (a 1-slide-solution-summary)

#### **The following slides show a step-by-step solving of this exercise**

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

MPI course  $\rightarrow$  Chap.11-(1) Shared Memory One-sided Communication  $\rightarrow$  Exercise 5 (advanced)

 $\frac{1}{2}$ 

### data-rep\_base.c

```
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
typedef long arrType ;
#define arrDataType MPI LONG /* !!!!! C A U T I O N : MPI Type must fit to arrType | !!!!! */
static const int arrSize=16*1.6E7 ;
int main (int argc, char *argv[])
{
  int it ;
  int rank world, size world;
  arrType \overline{x}arr;
  int i;
  long long sum ;
/* ===> 1 <=== */
 MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank_world);
  MPI_Comm_size(MPI_COMM_WORLD, &size_world);
/* ===> 2 <=== */
    arr = (arrType *) malloc(arrSize * sizeof(arrType));
    if (\text{arr} == \text{NULL})print(f("arr NOT allocated, not enough memory\n');
        MPI_Abort(MPI_COMM_WORLD, 0);
    }
...
                                                    In each process, allocating an array for the replicated
                                                    TODO: Allocating only once per shared memory node!
                                                           This will be done in 3 steps: 2a, 2b-d, 2e-f
                                        During the exercise steps, you may add additional declarations
```




Hybrid Programming – MPI+X  $\rightarrow$  Appendix  $\rightarrow$  Solutions  $\rightarrow$  datarep

#### data-rep\_sol\_2a.c

…

```
…
 MPI Comm comm shm;
 int size shm, rank shm;
…
/* ===> 2 <=== */
 /* Create --> shared memory islands and --> shared memory window inside *//* -- comm_shm and -- win */
 MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, /*key=*/ 0, MPI_INFO_NULL, &comm_shm);
 MPI_Comm_size(comm_shm, &size_shm);
 MPI_Comm_rank(comm_shm, &rank_shm);
 /*TEST*/ // To minimize the output, we print only from 3 process per SMP node
 /*TEST*/ if ( rank shm == 0 || rank shm == 1 || rank shm == size shm - 1 )
     printf("\t\trank ( world: %i/%i, shm: %i/%i)\n", rank world, size world, rank shm, size shm);
 /*TEST*/ if(rank world==0) printf("ALL finalize and return !!!.\n"); MPI Finalize(); return 0;
/* TO DO:
 * substitute the following malloc
 */
```


#### data-rep\_sol\_2d.c

**C**

…

```
…
 MPI Win win;
  int individualShmSize ;
  arrType *shm_buf_ptr;
 /* output MPI Win shared query */MPI Aint arrSize ;
 int disp unit ;
…
/* ===> 2 <=== */
…
 /* instead of: arr = (arrType * ) malloc(arrSize * sizeof(arrType)); */
  if ( rank_shm == 0 )
  { individualShmSize = arrSize ; }
  else
  { individualShmSize = 0 ; }
  MPI_Win_allocate_shared( (MPI_Aint)(individualShmSize) * (MPI_Aint)(sizeof(arrType)), 
                                     sizeof(arrType), MPI_INFO_NULL, comm_shm, &shm_buf_ptr, &win );
  /* shm buf ptr is not used because it is only available in process rank shm==0 */MPI_Win_shared_query( win, 0, &arrSize_, &disp_unit, &arr );
  /*TEST*/ // To minimize the output, we print only from 3 process per SMP node
  /*TEST*/ if ( rank shm == 0 || rank shm == 1 || rank shm == size shm - 1 )
    printf("\t\trank ( world: \frac{2}{3}i/\frac{2}{3}i, shm: \frac{2}{3}i/\frac{2}{3}i) arrSize \frac{2}{3}i arrSize \frac{2}{3}i shm buf ptr=\frac{2}{3}p arr ptr=\frac{2}{3}p \leq 1,
            rank world, size world, rank shm, size shm, arrSize, (int) (arrSize), shm buf ptr, arr );
  /*TEST*/ if(rank world==0) printf("ALL finalize and return !!!.\n"); MPI Finalize(); return 0;
/* TO DO: Create communicator comm head with MPI Comm_split \rightarrow including all the rank shm == 0 processes.
                                                                                data-rep_sol_2d_30.f90
                                                                                 …
                                                                                ! INTEGER*8, DIMENSION(:), ALLOCATABLE :: 
                                                                                arr
                                                                                  INTEGER*8, DIMENSION(:), POINTER :: arr
                                                                                 …
                                                                                 /* ===> 1 <=== */
                                                                                 … similar to C
                                                                                 /* ===> 2 <=== */
                                                                                 …
                                                                                 → See next slide
                                                                      Fortran
```
Hybrid Programming – MPI+X  $\rightarrow$  Appendix  $\rightarrow$  Solutions  $\rightarrow$  datarep

#### data-rep\_sol\_2d\_f90.c **Fortran**

…

```
! INTEGER*8, DIMENSION(:), ALLOCATABLE :: arr
 INTEGER*8, DIMENSION(:), POINTER :: arr
/* ===> 1 <=== */
TYPE(MPI_Win) :: win
 INTEGER :: individualShmSize
 TYPE(C_PTR) :: arr_ptr, shm_buf_ptr
 INTEGER(KIND=MPI_ADDRESS_KIND) :: arrDataTypeSize, lb, ShmByteSize
! /* output MPI Win shared query */
 INTEGER(kind=MPI_ADDRESS_KIND) :: arrSize_
 INTEGER :: disp_unit
/* ===> 2 <=== */
 ! instead of: ALLOCATE(arr(1:arrSize))
 IF (rank shm == 0) THEN
    individualShmSize = arrSize
 ELSE
    individualShmSize = 0
  ENDIF
 CALL MPI_Type_get_extent(arrDataType, lb, arrDataTypeSize)
 ShmByteSize = individualShmSize * arrDataTypeSize
 disp_unit = arrDataTypeSize
 CALL MPI Win allocate shared( ShmByteSize, disp_unit, MPI_INFO_NULL, comm_shm, shm_buf_ptr, win )
 ! /* shm buf ptr is not used because it is only available in process rank shm==0 */CALL MPI_Win_shared_query( win, 0, arrSize_, disp_unit, arr_ptr )
 CALL C_F_POINTER(arr_ptr, arr, (/arrSize/) )
 ! TEST: To minimize the output, we print only from 3 process per SMP node
 IF ( (rank shm == 0) .OR. (rank shm == 1) .OR. (rank shm == size shm - 1) ) THEN
   WRITE(*,*) 'rank( world=',rank_world,' shm=',rank_shm,')',' arrSize=',arrSize,' arrSize =',arrSize
 ENDIF
 IF (rank world == 0) WRITE(*,*) 'ALL finalize and return!!!'; CALL MPI_Finalize(); STOP
```


### data-rep\_sol\_2f.c

```
…
 int color ;
 MPI Comm comm head;
 int size head, rank head;
…
/* ===> 2 <=== */
…
 /* Create communicator including all the rank shm = 0 *//* with the MPI Comm split: in color 0 all the rank shm = 0 ,
  * all other ranks are color = 1
 color=MPI_UNDEFINED ;
  if (rank shm==0) color = 0 ;
 MPI_Comm_split(MPI_COMM_WORLD, color, /*key=*/ 0, &comm_head);
 rank head = -1; // only used in the print statements to differentiate unused rank=-1 from used rank==0
  if( comm_head != MPI_COMM_NULL ) // if( color == 0 ) // rank is element of comm_head, i.e., it is head of one of 
the islands in comm_shm
  {
    MPI_Comm_size(comm_head, &size_head);
    MPI_Comm_rank(comm_head, &rank_head);
  }
 /*TEST*/ // To minimize the output, we print only from 3 process per SMP node
 /*TEST*/ if ( rank shm == 0 || rank shm == 1 || rank shm == size shm - 1 )
     printf("\t\trank ( world: %i/%i, shm: %i/%i, head: %i/%i) arrSize %i arrSize %i shm buf ptr = %p, arr ptr = %p \n",
rank world,size world,rank shm,size shm,rank head,size head, arrSize, (int)(arrSize), shm buf ptr, arr);
 /*TEST*/ if(rank world==0) printf("ALL finalize and return !!!.\n"); MPI_Finalize(); return 0;...
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Appendix  $\rightarrow$  Solutions  $\rightarrow$  datarep


### Solutions of MPI shared memory exercise: datarep

data-rep\_sol\_3-6.c (on this slide steps 3-4)

```
…
 /* ===> 3 <=== */
for( it = 0; it < 3; it++)
 {
/* only rank world=0 initializes the array arr *//* all rank_shm=0 start the write epoch: writing arr to their shm */
  MPI_Win_fence(/*workaround: no assertions:*/ 0, win);
  if( rank world == 0 ) /* from those rank shm=0 processes, only rank world==0 fills arr */
   {
    for( i = 0; i < \text{arrSize}; i++){ arr[i] = i + it ; }
   }
/* ===> 4 <=== */
/* Instead of all processes in MPI_COMM_WORLD, now only the heads of the
 * shared memory islands communicate (using comm_head).
 * Since we used key=0 in both MPI_Comm_split(...), process rank_world = 0
 * - is also rank 0 in comm_head
 * - and rank 0 in comm_shm in the color it belongs to. */
  if( comm head != MPI COMM NULL ) // if( color == 0 )
   {
    MPI Bcast(arr, arrSize, arrDataType, 0, comm head);
    /* with this Bcast, all other rank shm=0 processes write the data into their arr */
   }
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

```
Hybrid Programming – MPI+X \rightarrow Appendix \rightarrow Solutions \rightarrow datarep
```
## Solutions of MPI shared memory exercise: datarep

data-rep\_sol\_3-6.c (on this slide steps 5-6)

```
…
/* ===> 5 <=== */
  MPI_Win_fence(/*workaround: no assertions:*/ 0, win);
                            // after the fence all processes start a read epoch
/* Now, all other ranks in the comm_sm shared memory islands are allowed to access their shared memory array. */
/* And all ranks rank sm access the shared mem in order to compute sum */sum = 0:
  for( i = 0; i < \text{arrSize}; i++){
     //sum+= * ( shm buf ptr - rank shm * shmSize + i ) ;
     sum+= arr \mid i \mid;
   }
/* ===> 6 <=== */
 /*TEST*/ // To minimize the output, we print only from 3 process per SMP node
 /*TEST*/ if ( rank shm == 0 || rank shm == 1 || rank shm == size shm - 1 )
     printf ("it: %i, rank ( world: %i/%i, shm: %i/%i, head: %i/%i ):\tsum(i=%d...i=%d) = %lld \n",
                  it,rank world,size world,rank shm,size shm,rank head,size head,it,arrSize-1+it,sum);
 }
  /*TEST*/ if(rank world==0) printf("ALL finalize and return !!!.\n"); MPI Finalize(); return 0;
…
```


## Solutions of MPI shared memory exercise: datarep

#### data-rep\_sol\_7.c …

```
/* ===> 7 <=== */
 MPI_Win_fence(/*workaround: no assertions:*/ 0, win);
                   // free destroys the shm. fence to guarantee that read epoch has been finished
 MPI_Win_free(&win);
…
data-rep_solution.c
…
/* ===> 2 <=== */
…
// ADD ON: calculates the minimum and maximum sille of size shm
  int mm[2], minmax[2]; mm[0] = \frac{1}{2}size shm ; mm[1] = size shm ;
  if( comm_head != MPI_COMM_NULL )
  {
   MPI Reduce( mm, minmax, 2, MPI INT, MPI MAX, 0, comm head) ;
  }
  if( rank world == 0 )
  {
    printf("\n\tThe number of shared memory islands is: %i islands \n", size head ) ;
    if ( minmax[0] + minmax[1] == 0 )
    printf("\tThe size of all shared memory islands is: %i processes\n", -minmax[0] ) ;
    else
    printf("\tThe size of the shared memory islands is between min = \Si and max = \Si processes \n",
                                                                    -minmax[0], minmax[1]);
  }
// End of ADD ON. Note that the following algorithm does not require same sizes of the shared memory islands
/* ===> 3 <=== */
…
                                  Trick:
                                  Calculate the minimum through 
                                  calculating the maximum for the negative values
```
Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)



# **Quiz on Shared Memory**

- A. Before you call **MPI\_Win\_allocate\_shared**, what should you do? **MPI\_Comm\_split\_type**(comm\_old, **MPI\_COMM\_TYPE\_SHARED,** …, &comm\_sm) will quarantee that comm sm contains only processes of the same shared memory island.
- B. If your communicator within your shared memory island consists of 12 MPI processes, and each process wants to get an own window with 10 doubles (each 8 bytes),
	- a. which **window size** must you specify in **MPI\_Win\_allocate\_shared**?

 $10 * 8 = 80$  bytes

- b. And how long is the totally allocated shared memory? 80 \* 12 = 960 bytes
- c. The returned base\_ptr, will it be identical on all 12 processes? No, within each process, the base\_ptr points to its own portion of the totally allocated shared mem.
- d. If all 12 processes want to have a pointer that points to the beginning of the totally allocated shared memory, which MPI procedure should you use and with which major argument?

MPI Win shared query with rank  $= 0$ 

- e. If you do this, do these 12 pointers have identical values, i.e., are identical addresses? No, they point to the same physical address, but each MPI process may use different virtual addresses for this.
- C. Which is the major method to store data from one process into the shared memory window portion of another process?

#### **Normal assignments** (with C/C++ or Fortran) to the correct location, i.e., **no** calls to MPI\_Put/Get.

 $\sim$ 

Rolf Rabenseifner (HLRS), Georg Hager (NHR@FAU), Claudia Blaas-Schenner (VSC, TU Wien)

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Quiz 1 solution 256



## **Quiz on Shared Memory Model & Synchronization**

- A. Which MPI memory model applies to MPI shared memory? MPI\_WIN\_SEPARATE or **MPI\_WIN\_UNIFIED** ?
- B. "Public and private copies are **eventually** synchronized without additional RMA calls."



- Which process-to-process synchronization methods can be used that, e.g., a store to a shared memory variable gets visible to another process (within the processes of the shared memory window)?
	- **Any MPI one-sided synchronization** (e.g., MPI\_Win\_fence, …\_post/start, …, …\_lock/unlock)
	- **Any** (MPI) **synchronization** together with a **pair of MPI\_Win\_sync**
	- **Any** (MPI) **synchronization** together with a **pair of C++11 atomic\_thread\_fence(order)**
- D. That such a store gets visible in another process after the synchronization is named here as "*write-read-rule*". Which other rules are implied by such synchronizations and what do they mean?
	- *Read-write-rule*: a **load** (=*read*) in one process before the synchronization cannot be affected by a **store** (=*write*) in another process after the synchronization.
	- *Write-write-rule*: a **store** (=*write*) in one process before the synchronization cannot overwrite a **store** (=*write*) in another process after the synchronization.
- E. How can you define a **race-condition** and which problems arise from **cache-line false-sharing**?
	- Two processes access the **same shared variable** and at **least one process modifies** the variable **and the accesses are concurrent**.
	- Significant performance problems if two or more processes **often access different portions of the same cache-line**.

Hybrid Programming – MPI+X  $\rightarrow$  Programming models  $\rightarrow$  MPI + MPI-3.0 shared memory  $\rightarrow$  Quiz 2 solution 257/2399 257/2399



ш

# **Quiz on Virtual topologies**

- A. Which types of MPI topologies for virtual process grids exist?
- B. And for which use cases?

### **1. Cartesian topologies**

- **For** Cartesian data meshes with identical compute time per mesh element
- **For** any Cartesian process grid with identical compute time per process and numerical epoch, and its communication mainly on the virtual Cartesian grid between the processes
- **2. Distributed graph topologies and graph topologies**
	- **For** applications with unstructured grids
- C. Where are limits for using virtual topologies, i.e., which use cases do not really fit?
	- Applications with mesh refinements, dynamic load balancing and diffusion of mesh elements to other processes
		- $\rightarrow$  all cases with **changing virtual process grids over time**;
	- Communication pattern not known in advance.

