In slide 20 we find the following formula for P_core.
But I was wondering under which assumptions this formula was constructed.
Do we assume a core with only FMA units?
A single core ALWAYS has ADD and MULT, DIV etc. as well right? if this is the case, how do we construct n_FMA then?
Would that be, n_FMA = partition of FLOP units that is FMA * amount of total FLOP units * "extra" efficiency of FMA unit?
for example: one ADD unit, one MULT unit, 2 FMA units; n_FMA = 1/2 *2 * 2 = 2
This seems to be too far-fetched to me, but I don't understand how else the n_FMA factor is applied using multiplication in the P_core formula instead of addition.
Re: Question regarding slides of 13th of may, formula for P_core
this formula holds for architectures where the execution units for FMAs and other arithmetic operations like ADDs and MULs are bound to the same ports, i.e., you can issue *either* a FMA instruction *or* an ADD/a MUL instruction.
+------------------------------------------------------------------------+ | 64 entry unified scheduler | +------------------------------------------------------------------------+ 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | \/ \/ \/ \/ \/ \/ \/ \/ +-------+ +-------+ +-----+ +-----+ +-----+ +-------+ +--------+ +------+ | ALU | | ALU | | LD | | LD | | ST | | ALU | | ALU & | |SIMPLE| +-------+ +-------+ +-----+ +-----+ +-----+ +-------+ | Shift | | AGU | +-------+ +-------+ +-----+ +-----+ +-------+ +--------+ +------+ | 2ND | | Fast | | AGU | | AGU | | Fast | +--------+ | BRANCH| | LEA | +-----+ +-----+ | LEA | | BRANCH | +-------+ +-------+ +-------+ +--------+ +-------+ +-------+ +-------+ |AVX DIV| |AVX FMA| |AVX INT| +-------+ +-------+ | ALU | +-------+ +-------+ +-------+ |AVX FMA| |AVX MUL| +-------+ +-------+ +-------+ +-------+ |AVX MUL| |AVX ADD| +-------+ +-------+
+-------------------------------------------------------+ |2x32 FP0 FP2 FP1 FP3 | +-------------------------------------------------------+ 8 | 9 | 10 | 11 | 12 | 13 | \/ \/ \/ \/ \/ \/ +------+ +-------+ +-------+ +-------+ +-------+ +------+ | iST | |AVX MUL| |AVX ALU| |AVX MUL| |AVX ALU| | ST | +------+ +-------+ +-------+ +-------+ +-------+ +------+ +------+ +-------+ +-------+ +-------+ +-------+ | F2I | |AVX FMA| |AVX ADD| |AVX FMA| |AVX ADD| +------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ +-------+ | DIV | | CONV/ | | DIV | | AVX | +-------+ | SHUF | +-------+ | SHUF | +-------+ +-------+ +-------+ | AVX | | AVX | | SHUF | | SHUF | +-------+ +-------+You can see that the superscalarity of this architecture would allow us in fact to issue two FMAs and two ADDs in parallel.
We can actually check this by running a benchmark with this mixture of instructions in our loop kernel:
vfmadd132pd ymm3, ymm1, ymm0
vfmadd132pd ymm4, ymm20, ymm21
vaddpd ymm6, ymm1, ymm0
vaddpd ymm9, ymm20, ymm21
The likwid toolsuite allows us to run such individual microbenchmarks (which will be introduced later in this lecture) and get the following result (averaged over 10 runs):
MFlops/s: 54810.17
If we adjust our formula to consider all execution units, we would end up with something like this:
\( P_{\textrm{core}} = (n^{FP}_{super,FMA} + n^{FP}_{super,non-FMA}) \cdot n_{SIMD} \cdot f \)
with \( n^{FP}_{super,FMA} \) being all Flops the FMA units can do per cycle (i.e., \( 2 \times 2 = 4 \) as we have two FMA units) and \( n^{FP}_{super,non-FMA} \) being all Flops the other FP units can do per cycle (i.e., \( 2 \times 1 = 2 \) as we have two ADD units) and we end up calculating
\( P_{\textrm{core}} = (4 + 2) \cdot 4 \cdot 2.4\,\textrm{Gflop/s} = 57.6\,\textrm{Gflop/s} \)
which is exactly what we measure (95% of it, but this is within the acceptable range).
So long story short, the formula we teach in the lecture is a simplification as we assume no other FP operations can be issued besides FMAs, which is true for most but not all modern microarchitectures. However, for the time being, you are not required to know the different microarchitectures in such detail and can simply always assume the formula holds for all cases.