PTfS24: Assignment 6: GPUs | NHR Learning Platform

Opened: Thursday, 13 June 2024, 12:00 AM

Due: Thursday, 20 June 2024, 10:03 AM

(5 crd) Each node of Alex, the NHR@FAU's GPU cluster, consists of 2x 64 Core AMD Epyc CPUs and 8 A100 GPUs. Each core can execute 2 AVX2 FMAs per cycle and run at 2 GHz. Each of the 108 SMs in an A100 GPU can execute 32 double precision FMAs per cycle and run at 1.41 GHz.
Compute the peak double precision floating point performance of the node using (a) only the CPUs, (b) a single GPU, and (c) all GPUs without considering tensor units.
(20 crd) Consider a CPU based program that, among other things, executes a double precision matrix matrix multiplication.
Estimate the speedup of moving this DGEMM to the GPUs on Alex for the multiplication of 2 square matrices of size 2048.
Assume that for each DGEMM, both input matrices have to be transferred to the GPU and the result matrix has to be transferred back for further use on the CPU via the PCIe 4.0 x16 interface at 25 GB/s read or write. One PCIe interface is shared by 4 GPUs. Also assume, that each matrix entry only has to be transfered once from host to device and is then distributed to other GPUs as needed via NVLink. These GPU-to-GPU transfers can be fully overlapped and don't attribute to the performance estimation.
(5 crd) In the nodes of Frontier, the current #1 System on the Top500, a 64 core AMD Epyc CPU is paired with 8 MI250 GPU dies. Similarly to Alex, each core can execute 2 AVX2 FMA instructions per cycle. Each MI250 die features 110 Compute Units (CU), that each execute 64 FMAs per cycle at 1.7 GHz.
Compute the peak double precision floating point throughput of the CPU and GPU part without considering tensor units.
(15 crd) Each GPU die in Frontier is connected to the CPU via a 36GB/s read and 36GB/s write link.
Estimate the maximum speedup of moving the matrix multiplication from the previous task from the CPUs of Frontier to the GPUs on Frontier. How much faster is one frontier node than one Alex node in this scenario?
(15 crd)
Consider the following kernel code:
```
__global__ void kernel(double* A, double* B, double* C, int size) {
  int tidx = threadIdx.x + blockDim.x * blockIdx.x;
  int tidy = threadIdx.y + blockDim.y * blockIdx.y;

  if (tidx >= size || tidy >= size) return;
```
```
  C[tidx + tidy*size] += A[tidx + tidy*size] * B[tidy];
}
```
On a GPU with a DRAM latency of 670 cycles, a clock frequency of 1.41 GHz and a DRAM interface bandwidth of 2039 GB/s, (a) what is the minimum number of threads to saturate the DRAM interface? (b) How do you chose the thread block size to minimize cache traffic?
Write a simple benchmark code that adds 1 to all elements of an array (i.e., implements a[i] = a[i] + 1.0) on an A100-SXM-40GB GPU in Alex using CUDA in double precision.

Logging in to the Alex cluster is straightforward, just use alex.nhr.fau.de as the hostname.

On Alex you can request a node with a 40GB A100 GPU for interactive work by submitting the following command:

salloc -N 1 --gres=gpu:a100:1 -p a100 -C a100_40 --time 4:00:00

Of course, these options work just as fine if you submit a batch script.

You need to load the cuda module with

module load cuda

to make the NVIDIA tools (such as nvcc) available. For nvc++ the nvhpc module is required.

(a) (20 crd) Implement the benchmark on the GPU. Your application needs to take the following parameters:
• the problem size (number of elements in the array),
• the number of kernel repetitions,
• the CUDA grid size (number of blocks), and
• the CUDA block size (number of threads per block).
Since you specify the number of blocks as well as their size, you need to use a grid-stride loop. The number of kernel repetitions simply launches your main kernel the specified number of times. Due to mechanics not discussed in detail, the CUDA runtime will make sure that single kernel executions will not overlap. Make sure to synchronize once before starting the kernel timing and once before stopping the kernel timing.

You application needs to report
• the time spent in host-device transfers,
• the time spent in computing the actual vector add,
• the time spent in device-host transfers, and
• the bandwidths for these three sections.

(b) (10 crd) Experiment with different execution configurations. For a problem size of 32 Million, 16 repetitions, and 864 blocks, plot the bandwidth for the computation (excluding transfers between host and device) over varying block sizes.
What is the minimum block size in this particular case? How many threads does this amount to in total?
Remember that the maximum block size is 1024 threads.

(c) (10 crd) Measure and plot
• the bandwidth of the kernel, and
• the bandwidth for the host-device transfer,
each over varying problem sizes from 1 to 1 billion. Assume a fixed execution configuration of 3456 blocks with 256 threads each, and 24 repetitions. What is the maximum bandwidth in each case?