LIKWID-tut: Hands-on: ccNUMA (Part 2)

Detecting NUMA imbalance

Run the code with the MEM group and observe the memory data volume per CPU socket

$ likwid-perfctr -C N:0-31 -g MEM -m ./perf 2500 40000

Measure the actual traffic that goes over the CPU socket interconnect on ICX systems

$ likwid-perfctr -C N:0-31 -g UPI -m ./perf 2500 40000

Fix NUMA imbalance with memory policies

The system tool numactl provides command line switches to manipulate the memory policy applied by the Linux kernel. The default policy is "first touch," which means that the data gets mapped to a ccNUMA domain that is closest to the initializing (i.e., writing) hardware thread.

Try different memory policies and check the performance:

$ numactl <options> likwid-perfctr -C N:0-31 -g <group> -m ./perf 2500 40000

Try the following options:

bind all allocations to CPU socket 0 (-m 0)
bind all allocations to CPU socket 1 (-m 1)
interleave allocations across CPU sockets 0 and 1 (-i 0,1)

Which policy gives the best performance? Check the interconnect traffic with the UPI group.

Add parallel initialization of the grid

The grid (i.e, the vectors in the algorithm) is allocated and then initialized by a single thread. Add parallel initialization to all three constructors of the Grid class (top of Grid::Grid() in src/Grid.cpp) and measure the traffic between the CPU sockets and the local memory bandwidth again.

$ make
$ likwid-perfctr -C N:0-31 -g <group> -m ./perf 2500 40000

Is the data traffic on the UPI interconnect reduced? Compare the results with the data you got when applying memory policies. Does the performance scale across the sockets now?

Final question: Which option gives best performance?:

interleaving the memory pages across the domains and ignoring all ccNUMA issues, or
Doing proper parallel initialization and accepting bad scaling of the preconditioner?

What could be a (strategic) conclusion? What should we do next?

Last modified: Sunday, 23 July 2023, 7:15 PM