Assignment 3: Pmax, pipelines, strides
Completion requirements
Opened: Thursday, 16 May 2024, 10:00 AM
Due: Thursday, 30 May 2024, 10:03 AM
- Another triad code. (20 credits) Calculate the maximum achievable in-core performance \( P_\max \) of the AVX512-vectorized double-precision Schönauer triad on an Ice Lake SP core at 2.1 GHz. The Schönauer triad kernel is
a[i] = b[i] + c[i] * d[i]
. What fraction of the theoretical core peak performance (at 2.1 GHz) is this? Also calculate \( P_\max \) for the AVX2 and scalar variants. - Pipelines and SIMD units. One way to increase the peak performance of a microprocessor is to use wider SIMD units and registers, just as Intel did when going from SSE (16 byte) to AVX (32 byte). 64-byte SIMD or AVX512 registers are available in the "Xeon Skylake-X" and later architectures. Consider the following loop (a scalar product) on a modern processor core:
double s=0.0;
double a[N],b[N];
// ... initialization omitted
for(i=0; i<N; ++i) s = s + a[i] * b[i];
The add and multiply instructions have a latency of four cycles and a maximum throughput of one per cycle. The load instruction has a latency of five cycles (if the data is in L1 cache) and a maximum throughput of one per cycle. FMA (if it exists) is not used.
(a) (5 credits) Assuming that the compiler applies no unrolling and no SIMD vectorization, calculate the maximum possible performance of the loop in Flops/cycle if the data comes from the L1 cache.
(b) (15 credits) Calculate the performance to expect at N=8.
(c) (10 credits) Now assume that the compiler can SIMD-vectorize the loop with AVX, but does no unrolling beyond that. Also assume that adding up the four slots ("horizontal add") of an AVX register takes eight additional cycles. Calculate how this changes the predictions from (a) and (b). How would the numbers change for 8-wide SIMD units (64-byte registers)? (Horizontal add takes 12 cycles in this case)
(d) (15 credits) Now assume that the compiler performs a 2-way modulo unrolling of the loop, on top of AVX. Calculate the maximum expected performance with AVX if the data comes from the L1 cache. Would 4-way unrolling change anything? Why (not)?
(e) (10 credits) Comparing the non-unrolled code from (a) with the AVX + 2-way MVE variant from (d), how many instructions (as a function of N) must be executed by the core to traverse the whole loop? Assume that the "loop mechanics" (handling the iteration count) always needs 3 instructions (add, compare, branch).
- Strided access. (25 credits) Write a benchmark code for the double-precision "vector update" kernel and modify it so that only each Mth element is used:
for(i=0; i<N; i+=M) a[i] = s * a[i];
This is called a "strided loop" with stride M. Plot the performance of this loop versus M∈{1,2,4,8} for N=108 on one Fritz core (do not forget to fix the clock frequency) and explain the change in performance with growing M. What happens if you increase M even further? Choose strides of M=8*1.2n, with n a positive integer and M<=106. Explain why this behavior is expected.