NLPE@LRZ: Hands-on: Analyzing the MiniMD proxy app

A diagnostic performance analysis of the MiniMD proxy app

In this exercise you will quantify and compare the effectiveness of SIMD vectorization for a molecular dynamics (MD) benchmark. In MD, one solves the Newtonian equations of motion for a collection of particles that exert forces on each other. One important part of this is the force calculation: How much force is exerted on each individual atom? There are two basic ways to do this:

Full-neighbor method: Iterate through the particles. For each particle, consider all its relevant neighbors and add up all the forces (according to some interaction potential, like Lennard-Jones) they exert on the current particle. This calculates all the forces twice since, according to Newton's Third Law, forces work both ways but we only consider the force on the current particle.
Half-neighbor method: Iterate through the particles. For each particle, consider all its relevant neighbors. For each neighbor, accumulate the force it exerts on the current particle but also accumulate the opposing force on the neighbor. This way, one can save a lot of work (almost half) since for each new particle we need to consider one less neighbor.

You will investigate which algorithm ("half-neigh" or "full-neigh") is best suited for SIMD vectorization and quantify how effective the compiler can employ SIMD for them. While one could blindly try and be guided by time to solution only, the additional insight provided by hardware counter profiling gives confidence based on data what is going on and what could be further optimization options.

Preparation

You can find the benchmark code in the MINIMD folder.

You may have a look at the instrumented force calculation variants. You find the functions in ./src/force_lj.cpp in the methods ForceLJ::compute_halfneigh line 79-139 and in ForceLJ::compute_fullneigh line 148-204. Which of them do you think is better suited for SIMD vectorization?

Compile benchmark

Examine build settings in include_ICC.mk

Build:

$ module load intel intelmpi likwid
$ make

You can ignore the warnings .

If you change the build settings you need to "make clean" to make them take effect.

You need to generate four variants:

without SIMD vectorization (scalar)
with SSE SIMD vectorization
with AVX2 SIMD vectorization
with AVX-512 SIMD vectorization

E.g., for the scalar code it is recommended to:

edit include_ICC.mk and ensure only the OPTS line that you need (the one with -no-vec) is uncommented.
execute
```
$ make clean && make
```
```
$ mv miniMD-ICC miniMD-novec
```

Repeat for SSE, AVX2, and AVX-512 and move binaries to `miniMD-SSE`, `miniMD-AVX2`, and `miniMD-AVX512`.

Caveat: For proper SIMD vectorization, make sure to add the "-DUSE_SIMD" option to the compiler command line. For the non-vectorized version you should not use it. (This should already be set up in the include file)

Run benchmark

Change to the ./data folder. To get an overview of available options, do:

$ cd data

$ likwid-mpirun --mpi intelmpi -np 1 ../miniMD-<VERSION>  -h

To run the benchmark, use:

$ likwid-mpirun --mpi intelmpi -np 1 ../miniMD-<VERSION> --half_neigh <0|1>

The number specifies if the half-neigh variant should be chosen (0 == off, 1 == on).

Hardware performance counter profiling

Use the FLOPS_DP performance group and note the event counts for every run:

INSTR_RETIRED_ANY
FP_ARITH_INST_RETIRED_SCALAR_DOUBLE
FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE
FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE
FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE

In addition note the following derived metrics:

Runtime (RDTSC)
CPI
Vectorization ratio

Do this for the following runs (it is necessary to use likwid-mpirun because miniMD is an MPI code):

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-novec --half_neigh 1

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-SSE --half_neigh 1

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-AVX2 --half_neigh 1

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-AVX512 --half_neigh 1

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-novec --half_neigh 0

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-SSE --half_neigh 0

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-AVX2 --half_neigh 0

$ likwid-mpirun --mpiopts "--cpu-freq=2400000-2400000:performance" --mpi slurm -np 1 -g FLOPS_DP -m ../miniMD-AVX512 --half_neigh 0

Analysis of the profiling results

Look at the following metrics for each algorithm:

Percentage of arithmetic floating point instructions (the useful work) to overall instructions (the processor work).
Vectorization ratio as reported by likwid-perfctr, check the help text for the FLOPS_DP group how it is calculated
CPI as the central metric of execution efficiency for the given instruction mix.

To compare the different versions setup the following relations:

Total instructions of <SSE|AVX2|AVX512> version compared to novec (for HN / FN each)
Arithmetic instructions of <SSE|AVX2|AVX512> version compared to novec (for HN / FN each)
Total instructions of every version of FN compared to the same version of HN

You can do that with pen and paper, but we prepared an Excel sheet to speed things up. You can download it on the Moodle page.

Questions:

Is autovectorization efective for HN/FN?
Can you interpret the runtimes? What role does the CPI value play?
How do both versions compare with regard to actual user work and how does this change with vectorized code?

Last modified: Wednesday, 4 December 2024, 2:02 PM