PTfS25: Assignment 6: GPUs | NHR Learning Platform

Opened: Wednesday, 11 June 2025, 12:00 AM

Due: Thursday, 19 June 2025, 10:05 AM

You are given the specifications of a hypothetical HPC cluster. It features 192 compute nodes, each with 2 CPUs and 4 GPUs. Each CPU has 64 cores, each capable of executing 1 AVX512 FMA per cycle at 2.4 GHz. Each GPU has 168 Streaming Multiprocessors (SMs), each with 4 SM Subpartitions (SMSPs). Each SMSP can execute 1 double precision FMA or 16 single precision FMAs per cycle at 1.5 GHz. The GPUs of this cluster do not use tensor cores for the following exercises. The CPUs and GPUs of a node are connected via 2 PCIe interfaces each capable of transferring 32.0 GB/s in each direction. They can work concurrently.

(16 crd) Compute the theoretical peak floating-point performance (in FLOP/s) for both single and double precision in each of the following cases:
- (a) The CPUs of a single node
- (b) A single GPU
- (c) All GPUs of a single node
- (d) All GPUs in the entire cluster
(16 crd) Consider an application that, among other things, executes dense matrix-matrix multiplications and the following scenarios:
- (a) Using only the CPUs only
- (b) Using only the GPUs; matrices are already on the device and no host-device transfer is required
- (c) Using only the GPUs; input matrices need to be transferred from the host to the device, and the output matrix needs to be transferred back to the host. Overlapping both transfers is not possible.
Estimate the time required for the multiplication of two square matrices of size (# rows = # columns = ) 4096 on one full compute node for both single and double precision, each for the given scenarios. What is the speed-up for scenario (c) over scenario (a) (for both single and double precision)?
(8 crd) Consider the following kernel code:
```
__global__ void kernel(float* A, float* B, float* C, int size) {
  int tidx = threadIdx.x + blockDim.x * blockIdx.x;
  int tidy = threadIdx.y + blockDim.y * blockIdx.y;

  if (tidx >= size || tidy >= size) return;

  C[tidx + tidy*size] += A[tidx + tidy*size] * B[tidy];
}
```
Assuming a DRAM latency of 800 cycles on the GPU, a clock frequency of 1.5 GHz, and a DRAM bandwidth of 900.0 GB/s, answer the following questions:
- (a) What is the minimum number of threads to saturate the DRAM interface?
- (b) How do you choose the thread block size to minimize L2 cache traffic?
Write a benchmark application that scales all elements of an array with 2 (i.e., implements a[i] = 2 * a[i]). Array elements are of type double.
Run the code on a single A40 GPU of the Alex cluster using the following steps:
- Log in to the Alex cluster (alex.nhr.fau.de).
- Allocate a GPU node for 90 minutes with the command:
  salloc -N 1 --gres=gpu:a40:1 -p a40 --time 1:30:00
  Submitting a batch script works as well.
- Load the cuda module with the command
  module load cuda
  to make the nvcc compiler available. For the nvc++ compiler, load the nvhpc module.
(30 crd) Implement the benchmark application.
It needs to be able to run on a single GPU, build on CUDA and use double-precision floating-point numbers for the array elements. Your application also needs to take the following parameters from the command line:
- Problem size (number of elements in the array)
- Number of kernel repetitions
- CUDA grid size (number of blocks)
- CUDA block size (number of threads per block)
Since you specify the number of blocks as well as their size, you need to use a grid-stride loop. The number of kernel repetitions simply launches your main kernel the specified number of times. Due to mechanics not discussed in detail, the CUDA runtime will make sure that single kernel executions will not overlap.
Make sure to synchronize the device once before starting the kernel timing and again before stopping it.
Copy the (input) array from the host to the device before starting the kernel timing, and copy the updated array back to the host after finishing it.
You application needs to report
- the time spent in host-device transfers,
- the time spent in computing,
- the time spent in device-host transfers, and
- estimated bandwidths for these three sections.
(12 crd) Experiment with different execution configurations. For a problem size of 32 Million, 64 repetitions, and 168 blocks, plot the bandwidth for the computation (excluding transfers between host and device) over varying block sizes.
What is the minimum block size in this particular case for saturating the memory bandwidth? How many threads does this amount to in total?
Remember that the maximum block size is 1024 threads.
(18 crd) Measure and plot
- the estimated bandwidth of the kernel, and
- the bandwidth of the device-host transfer,
each over varying problem sizes from 1 to 1 billion.
Assume a fixed execution configuration of 1344 blocks with 256 threads each, and 32 repetitions.
What is the maximum bandwidth you observed in both series and how does it relate to the theoretical maximum (DRAM and PCIe)?