Hands-on: ccNUMA (Part 2)
Detecting NUMA imbalance
Run the code with the MEM group and observe the memory data volume per CPU socket
$ likwid-perfctr -C N:0-31 -g MEM -m ./perf 2500 40000
Measure the actual traffic that goes over the CPU socket interconnect on ICX systems
$ likwid-perfctr -C N:0-31 -g UPI -m ./perf 2500 40000
Fix NUMA imbalance with memory policies
The
system tool numactl
provides command line switches to manipulate the
memory policy applied by the Linux kernel. The default policy is "first
touch," which means that the data gets mapped to a ccNUMA domain that is closest to
the initializing (i.e., writing) hardware thread.
Try different memory policies and check the performance:
$ numactl <options> likwid-perfctr -C N:0-31 -g <group> -m ./perf 2500 40000Try the following options:
- bind all allocations to CPU socket 0 (
-m 0
) - bind all allocations to CPU socket 1 (
-m 1
) - interleave allocations across CPU sockets 0 and 1 (
-i 0,1
)
Which policy gives the best performance? Check the interconnect traffic with the UPI group.
Add parallel initialization of the grid
The grid (i.e, the vectors in the algorithm) is allocated and then initialized by a single thread. Add parallel initialization to all three constructors of the Grid class (top of Grid::Grid()
in src/Grid.cpp
) and measure the traffic between the CPU sockets and the local memory bandwidth again.
$ makeIs the data traffic on the UPI interconnect reduced? Compare the results with the data you got when applying memory policies. Does the performance scale across the sockets now?
$ likwid-perfctr -C N:0-31 -g <group> -m ./perf 2500 40000
Final question: Which option gives best performance?:
- interleaving the memory pages across the domains and ignoring all ccNUMA issues, or
- Doing proper parallel initialization and accepting bad scaling of the preconditioner?
What could be a (strategic) conclusion? What should we do next?