PTfS25: Question regarding slides of 13th of may, formula for P_core

In slide 20 we find the following formula for P_core.
But I was wondering under which assumptions this formula was constructed.

Do we assume a core with only FMA units?
A single core ALWAYS has ADD and MULT, DIV etc. as well right? if this is the case, how do we construct n_FMA then?

Would that be, n_FMA = partition of FLOP units that is FMA * amount of total FLOP units * "extra" efficiency of FMA unit?
for example: one ADD unit, one MULT unit, 2 FMA units; n_FMA = 1/2 *2 * 2 = 2

This seems to be too far-fetched to me, but I don't understand how else the n_FMA factor is applied using multiplication in the P_core formula instead of addition.

Re: Question regarding slides of 13th of may, formula for P_core

by Jan Laukemann - Wednesday, 14 May 2025, 3:59 PM

Hi Thies,

this formula holds for architectures where the execution units for FMAs and other arithmetic operations like ADDs and MULs are bound to the same ports, i.e., you can issue *either* a FMA instruction *or* an ADD/a MUL instruction.

Therefore, n_FMA is always either 1 (no FMA available) or 2 (FMAs available).

This becomes more clear if we look at the port model of an Intel Broadwell architecture, for example (or check out the whole model here):

  +------------------------------------------------------------------------+
  |                         64 entry unified scheduler                     |
  +------------------------------------------------------------------------+
     0 |       1 |      2 |     3 |     4 |      5 |        6 |       7 |
       \/        \/       \/      \/      \/       \/         \/        \/
   +-------+ +-------+ +-----+ +-----+ +-----+ +-------+ +--------+ +------+
   |  ALU  | |  ALU  | |  LD | |  LD | |  ST | |  ALU  | |  ALU & | |SIMPLE|
   +-------+ +-------+ +-----+ +-----+ +-----+ +-------+ |  Shift | |  AGU |
   +-------+ +-------+ +-----+ +-----+         +-------+ +--------+ +------+
   |  2ND  | |  Fast | | AGU | | AGU |         |  Fast | +--------+
   | BRANCH| |  LEA  | +-----+ +-----+         |  LEA  | | BRANCH |
   +-------+ +-------+                         +-------+ +--------+
   +-------+ +-------+                         +-------+
   |AVX DIV| |AVX FMA|                         |AVX INT|
   +-------+ +-------+                         |  ALU  |
   +-------+ +-------+                         +-------+
   |AVX FMA| |AVX MUL|
   +-------+ +-------+
   +-------+ +-------+
   |AVX MUL| |AVX ADD|
   +-------+ +-------+

We can do either a MUL or a FMA on port 0 and either an ADD or a FMA on port 1. That is why in the slide, \( n^{FP}_{super} =2 \) (two FP instructions per cycle) and \( n_{FMA} = 2 \) (FMAs available).

This assumption of being able to do one or the other is true for many modern CPU microarchitectures such as all the Intel processors on the slide, the NVIDIA Grace chip, or the A64FX CPU.

However, the Zen4 core is actually an exception in this list.

The (simplified) FP-part of the port model of the Zen 4 looks like this (or, alternatively, check out the marketing slide via https://images.anandtech.com/doci/17585/SoC_12.png):

 +-------------------------------------------------------+
 |2x32       FP0       FP2       FP1       FP3           |
 +-------------------------------------------------------+
  8 |      9 |       10 |      11 |      12 |      13 |
    \/       \/         \/        \/        \/        \/
 +------+ +-------+ +-------+ +-------+ +-------+ +------+
 |  iST | |AVX MUL| |AVX ALU| |AVX MUL| |AVX ALU| |  ST  |
 +------+ +-------+ +-------+ +-------+ +-------+ +------+
 +------+ +-------+ +-------+ +-------+ +-------+
 |  F2I | |AVX FMA| |AVX ADD| |AVX FMA| |AVX ADD|
 +------+ +-------+ +-------+ +-------+ +-------+
          +-------+ +-------+ +-------+ +-------+
          |  DIV  | | CONV/ | |  DIV  | | AVX   |
          +-------+ |  SHUF | +-------+ |  SHUF |
                    +-------+ +-------+ +-------+
                    | AVX   | | AVX   |
                    |  SHUF | |  SHUF |
                    +-------+ +-------+

You can see that the superscalarity of this architecture would allow us in fact to issue two FMAs and two ADDs in parallel.

We can actually check this by running a benchmark with this mixture of instructions in our loop kernel:

vfmadd132pd ymm3, ymm1, ymm0
vfmadd132pd ymm4, ymm20, ymm21
vaddpd      ymm6, ymm1, ymm0
vaddpd      ymm9, ymm20, ymm21

The likwid toolsuite allows us to run such individual microbenchmarks (which will be introduced later in this lecture) and get the following result (averaged over 10 runs):
MFlops/s: 54810.17

As you can see, based on the formula in the slides we should get a maximum of 38.4 GFlop/s, however we reach almost 43% more than that.
If we adjust our formula to consider all execution units, we would end up with something like this:

\( P_{\textrm{core}} = (n^{FP}_{super,FMA} + n^{FP}_{super,non-FMA}) \cdot n_{SIMD} \cdot f \)

with \( n^{FP}_{super,FMA} \) being all Flops the FMA units can do per cycle (i.e., \( 2 \times 2 = 4 \) as we have two FMA units) and \( n^{FP}_{super,non-FMA} \) being all Flops the other FP units can do per cycle (i.e., \( 2 \times 1 = 2 \) as we have two ADD units) and we end up calculating

\( P_{\textrm{core}} = (4 + 2) \cdot 4 \cdot 2.4\,\textrm{Gflop/s} = 57.6\,\textrm{Gflop/s} \)

which is exactly what we measure (95% of it, but this is within the acceptable range).

So long story short, the formula we teach in the lecture is a simplification as we assume no other FP operations can be issued besides FMAs, which is true for most but not all modern microarchitectures. However, for the time being, you are not required to know the different microarchitectures in such detail and can simply always assume the formula holds for all cases.

Re: Question regarding slides of 13th of may, formula for P_core

by Thies Weel - Wednesday, 14 May 2025, 5:41 PM

Thanks a lot for the detailed explanation, that really cleared it up. Especially the part about FMA replacing ADD/MUL in the model, I get now why nFMA is used like that. The Zen 4 example and LIKWID results made it very concrete. Appreciate the time you took to explain this!