Assignment 1: Code execution, peak performance, square root reloaded
- Pipelines. Consider the following code:
double a[...],b[...]; double s=0.0,t=1.234; // a[] and b[] contain sensible data
for(int i=0; i<N; ++i) {
s += a[i]*a[i];
b[i] *= t;
}This code is run on a superscalar, out-of-order CPU core with the following properties: - Floating-point ADD pipeline depth of 5 cy
- Floating-point MULT pipeline depth of 8 cy
- Capability of executing 1 ADD, 1 MULT, 1 LOAD, and 1 STORE instruction per cycle (no FMA)
- Overall instruction throughput limit of 4 instructions retired per cycle
- Register set of 16 floating-point registers and 16 integer registers
- No SIMD capability
We assume that the required data (i.e., arraysa[]
andb[]
) resides in the L1 cache.
Answer the following questions:(a) (20 credits) Assuming the compiler does not know about Modulo Variable Expansion (MVE) but otherwise produces perfect code, what is the hardware bottleneck that limits the performance of the code? Calculate the expected performance in flops/cy.
(b) (20 credits) Now assume that the compiler can employ MVE. How much unrolling must be applied at least? What is the hardware bottleneck for the "best possible" code? Calculate the best possible performance in flops/cy. - Peak performance.
Floating-point computations are an important resource
of a processor. Assume a CPU chip with the following properties:
- Clock frequency of 2.2 GHz
- 56 cores
- AVX-512 instruction set (512-bit wide SIMD registers); each core is capable of retiring two full-width double-precision FMA instructions per cycle
(15 crd) Calculate the theoretical single-precision floating-point peak performance of the chip in Tflop/s. You may assume that the FMA SIMD units are fully parallel, i.e., the computations across the SIMD lanes are done concurrently. - Clock frequency of 2.2 GHz
- Square roots reloaded. Use your double-precision π code from Assignment 0 and compile it successively with the following compiler options (instead of -O3 -xHost):
-O1 -fno-vectorize
-O3 -xSSE4.2
-O3 -xCORE-AVX2
-O3 -xCORE-AVX512 -qopt-zmm-usage=high
This generates scalar and 16-byte, 32-byte, and 64-byte wide SIMD instructions, respectively, for all floating-point arithmetic (including the square root).
(a) (10 crd) Run the four code variants and report the number of cycles per SQRT operation for all four cases.
(b) (15 crd) Why is the numerical approximation for π different each time?
(b) (20 crd) Does anything strike you as odd? Discuss what might be happening here. Remember that SQRT (the dominant operation here) is a complicated operation that takes a lot of chip space.