# Programming Techniques for Supercomputers: Performance Modelling Motivation Roofline Model Prof. Dr. G. Wellein<sup>(a,b)</sup>, Dr. G. Hager<sup>(a)</sup> (a)HPC Services – Regionales Rechenzentrum Erlangen (b)Department für Informatik Friedrich-Alexander-Universität Erlangen-Nürnberg, Sommersemester 2024 A performance model brings together what you need (application requirements) and what you get (hardware capabilities) A series of measurements from benchmarks is NOT a performance model\* \*Bill Gropp, PASC2015 PTfS 2024 # Scope of the lecture – a typical example # How model-building works: Physics #### Newtonian mechanics $\vec{F} = m\vec{a}$ Fails @ small scales! Nonrelativistic quantum mechanics alls @ even smaller scales! Relativistic quantum field theory $U(1)_Y \otimes SU(2)_L \otimes SU(3)_c$ ### Code optimization/parallelization – no black boxes! PTfS 2024 # Questions to ask in high performance computing - Do I understand the performance behavior of my code? - Does the performance match a model I have made? - What is the optimal performance for my code on a given machine? - High Performance Computing == Computing at the bottleneck - Can I change my code so that the "optimal performance" gets higher? - Circumventing/ameliorating the impact of the bottleneck - My model does not work what's wrong? - This is the good case, because you learn something - Performance monitoring / microbenchmarking may help clear up the situation - Use your brain! Tools may help, but you do the thinking. "Simple" performance modeling: The Roofline Model Loop-based performance modeling: Execution vs. data transfer ### A simple performance model for loops ### Simplistic view of the hardware # **Execution units** Peak Performance Unit: flop / s Data path, bandwidth Unit: byte / s Data source/sink Performance Bottlenecks: Peak Performance: Data path: flop/s required by incomming data: ### Simplistic view of the software: ``` ! may be multiple levels do i = 1,<sufficient> <complicated stuff doing</pre> N flops causing V bytes of data transfer> enddo ``` ### Computational intensity $$I = \frac{N}{V} \rightarrow \text{Unit: flop/byte}$$ ("Compute bound") P<sub>peak</sub> byte/s \* flop/byte [=flop/s] ("Memory bound") PTfS 2024 June 18, 2024 ### Naïve Roofline Model What performance can the software achieve on a given hardware? P [flop/s] The performance bottleneck is either - The execution of work (flops): - The data path: (requested flops by incoming data) - $P_{\text{peak}}$ [flop/s] - $I \cdot b_S$ [flop/byte x byte/s] $$P = \min(P_{\text{peak}}, I \cdot b_S)$$ This is the "Naïve Roofline Model" - High intensity I: P limited by execution - Low intensity I: P limited by data transfer - "Knee" at $P_{max} = I \cdot b_S$ : Best use of resources - Roofline is an "optimistic" model ("light speed") Roofline Model (RLM) – Basics Consider two bottlenecks only ### The Roofline Model – Basics - Hardware $\rightarrow$ Peak performance: $P_{peak} \begin{bmatrix} \frac{F}{S} \end{bmatrix}$ - Hardware $\rightarrow$ Peak memory bandwidth: $b_S \left[\frac{B}{s}\right]$ - Application/SW $\rightarrow$ Computational Intensity: $I\left[\frac{F}{B}\right]$ #### Roofline Performance Model (RLM) - basics: #### Machine model: $$P_{peak} = 3\frac{GF}{\frac{GB}{S}}$$ $$b_S = 10\frac{GB}{\frac{S}{S}}$$ #### Application model: $$I = B_C^{-1} = 0.05 \frac{F}{B}$$ $$P = \min\left(\frac{P_{\text{peak}}, I * b_{\text{S}}}{s}\right) = \min\left(3\frac{GF}{s}, 0.05 * 10\frac{GF}{s}\right) = 0.5\frac{GF}{s}$$ # The Roofline Model: A graphical view Plot max. attainable performance P as a function of I (application) for a given hardware $\{P_{peak}, b_S\}$ Computational intensity I [F/B] log-scale Vector norm (single prec.) s=s+a[i]\*a[i]:I = 0.5 F/B ### The Roofline Model – Basics ### Compare capabilities of different machines PTfS 2024 June 18, 2024 13 # The Roofline Model – Basics: Summary $$P = \min(P_{peak}, I * b_S)$$ ### Determine machine model for full chip/node/device: - Peak performance $P_{peak} = P_{chip} = n_{core} \cdot n_{super}^{FP} \cdot n_{FMA} \cdot n_{SIMD} \cdot f$ - Peak memory bandwidth: See fact sheet, e.g. $b_S = \#Channels \times f_{MEM} \times 8 \xrightarrow{B}_{cycle}$ #### So far the model is very restricted: - Machine and application models are completely independent - RLM always provides upper bound but is it realistic? - Only two bottlenecks are considered - Peak Performance - Main memory transfers ``` double s=0, a[]; for(i=0; i<N; ++i) { s = s + a[i];}</pre> ``` - What if, e.g. there is no MULT and/or no SIMD vectorization? - $\rightarrow P_{peak}$ is not a realistic limit! Implementation may have lower "horizontal roof" ### **Roofline Model** ### Machine model with $P_{peak}$ =4.5 TF/s and $b_S$ =300 GB/s # Roofline Model: Application information Measure application performance P and calculate / measure application intensity I More realistic bounds for "bad" implementations ### More realistic bounds for "bad" implementations No SIMD, no pipelining, 1 FMA only $\rightarrow$ 64 x decrease in P<sub>Peak</sub> Reality: Lower horizontal roofs (P<sub>peak</sub>) are typically not known # The Roofline model: Extending more bottlenecks Choose time based view: Hardware bottlenecks impose upper (lower) performance (runtime) limits \*Williams, Waterman, Patterson (2009), DOI: <u>10.1145/1498765.1498785</u> Roofline Model (RLM) – Refined Consider multiple independent bottlenecks # The Roofline model: Extending more bottlenecks Extend towards mutiple (independent) bottlenecks → Model very successfull if bottleneck can be saturated → full CPU chip \*Williams, Waterman, Patterson (2009), DOI: <u>10.1145/1498765.1498785</u> ### The Roofline Model – refined - 1. $P_{\text{max}}$ = Applicable peak performance of a loop, assuming that data comes from the level 1 cache (this is not necessarily $P_{\text{peak}}$ ) - $\rightarrow$ e.g., $P_{\text{max}} = 176 \text{ GFlop/s}$ - 2. $I = \text{Computational intensity ("work" per byte transferred) over the slowest data path utilized (code balance <math>B_C = I^{-1}$ ) - $\rightarrow$ e.g., I = 0.167 Flop/Byte $\rightarrow B_C = 6$ Byte/Flop - 3. $b_S$ = Applicable (saturated) peak bandwidth of the slowest data path utilized $$\rightarrow$$ e.g., $b_S$ = 56 GByte/s Expected performance: $$P = \min(P_{\text{max}}, I \cdot b_S) = \min\left(P_{\text{max}}, \frac{b_S}{B_C}\right)$$ [Byte/Flop] [Byte/s] R.W. Hockney and I.J. Curington: $f_{1/2}$ : A parameter to characterize memory and communication bottlenecks. Parallel Computing 10, 277-286 (1989). DOI: 10.1016/0167-8191(89)90100-2 W. Schönauer: Scientific Supercomputing: Architecture and Use of Shared and Distributed Memory Parallel Computers. Self-edition (2000) S. Williams: <u>Auto-tuning Performance on Multicore Computers</u>. UCB Technical Report No. UCB/EECS-2008-164. PhD thesis (2008) # The Roofline Model – getting it right ### Applicable peak performance: $P_{max} = n_{core} * P_{max}^{core}$ P<sup>core</sup><sub>max</sub>: single core maximum performance from L1: determine according to slides 22-41@03b\_04\_30-2024\_PTfS.pdf #### Computational intensity: I • Determine data transfer volume over slowest data path – for main memory: $I = 1/B_C^{mem}$ (for $B_C$ see 05\_05\_08-2024\_PTfS.pdf) ### Applicable (saturated) peak bandwidth: $b_S$ - Determine with appropriate benchmark, e.g. for main memory choose the STREAM benchmark test that best matches your access pattern - See later for STREAM - Or write own microbenchmark if relevant access pattern not available, e.g. read-only ### Realistic baseline for memory bandwidth: STREAM - Assumption: STREAM (or similar, like vector triad) kernel benchmarks achieve an upper bandwidth limit from main memory - i.e., no code can draw more bandwidth - Theoretical BW limits are usually not achievable - Use STREAM as BW limit rather than the theoretical numbers! - STREAM: <a href="http://www.cs.virginia.edu/stream/">http://www.cs.virginia.edu/stream/</a> - Set of 4 standard benchmarks ``` COPY: A(:) = C(:) SCALE: A(:) = s * C(:) ADD: A(:) = B(:) + C(:) TRIAD: A(:) = B(:) + s * C(:) ``` - In practice, COPY & SCALE (ADD & TRIAD) draw the same bandwidth - Advantage of STREAM: Many results published, well-defined benchmark - Disadvantage of STREAM: Reported and actual BW numbers may differ # STREAM: write-allocate and efficiency Data transfer (including write-allocate) | | | • | | |-------|--------------------|-----------------------------------|-----------| | Туре | Kernel | Bytes/iteration assumed (with WA) | Flops/it. | | COPY | A(:) = B(:) | 16 (24) | 0 | | SCALE | A(:) = s*B(:) | 16 (24) | 1 | | ADD | A(:) = B(:) + C(:) | 24 (32) | 1 | | TRIAD | A(:) = B(:)+s*C(:) | 24 (32) | 2 | STREAM benchmark does not know about write-allocate | | with | write-allo | cate | w/o write | -allocate | |-------|--------------------|----------------|------------------|-----------|----------------| | Туре | reported | actual | $b_S/b_{ m max}$ | reported | $b_S/b_{\max}$ | | COPY | 34079 <u>**3/2</u> | <b>→</b> 51119 | 0.75 | 47281 | 0.69 | | SCALE | 33758 ×3/2 | →50637 | 0.74 | 48025 | 0.70 | | ADD | 38174 ×4/3 | → 50899 | 0.75 | 51068 | 0.75 | | TRIAD | 38866 ×4/3 | →51820 | 0.76 | 51107 | 0.75 | State of the art compilers recognize the benchmark and avoid the write-allocate automatically 70-75% efficiency PTfS 2024 Roofline Model (RLM) – Refined Arithmetic Intensity / Code Balance: Gymnastics # Arithmetic Intensity / Code Balance: Basic Examples ``` double a[], b[]; for(i=0; i<N; ++i) { a[i] = a[i] + b[i];}</pre> ``` ``` double a[], b[]; for(i=0; i<N; ++i) { a[i] = a[i] + s * b[i];}</pre> ``` ``` float s=0, a[]; for(i=0; i<N; ++i) { s = s + a[i] * a[i];}</pre> ``` ``` float s=0, a[], b[]; for(i=0; i<N; ++i) { s = s + a[i] * b[i];} ``` ``` B_{\rm C} = 24B / 1F = 24 B/F / = 0.042 F/B ``` ``` B_{\rm C} = 24B / 2F = 12 B/F / = 0.083 F/B ``` Scalar – can be kept in register $$B_{\rm C} = 4B/2F = 2 B/F$$ $I = 0.5 F/B$ → Scalar – can be kept in register $$B_{\rm C} = 8B / 2F = 4 B/F$$ $I = 0.25 F/B$ → Scalar – can be kept in register # Approaches to determine Computational Intensity 1. Analysis of loop body → determine all load / stores that go to memory ``` double a[N], b[N], c[N], d[N]; for(i=0; i<N; ++i) { a[i] = b[i] + c[i] * d[i]; }</pre> ``` - 3 LD (b,c,d) + 1 ST (a) + 1 WA (a) per iteration - Each LD / ST / WA is 8 Byte (double) - 2 FLOP • $$I = \frac{2 FLOP}{5*8 Byte} = \frac{1 FLOP}{20 Byte}$$ $(B_C = \frac{20 Byte}{1 FLOP})$ - Cache vs. Memory Access??!! → DMVM; stencils, SpMV - 2. Analysis of data structure → Assume each element is touched only once ``` double a[N], b[N], c[N], d[N]; for(i=0; i<N; ++i) { a[i] = b[i] + c[i] * d[i]; }</pre> ``` - 4 arrays (of size: N \* 8 Byte) + WA on a[] $\rightarrow$ 2x $\rightarrow$ 5 \* N \* 8 Byte = **40** \* N Byte - Total FLOP count: 2 \* N FLOP • $$I = \frac{2*N FLOP}{40*N Byte} = \frac{1 FLOP}{20 Byte} \quad (B_C = \frac{20 Byte}{1 FLOP})$$ Lower bound for memory traffic → Upper bound for I # Approaches to determine Computational Intensity ``` double precison A(R,C), x(C), y(R) ... do c = 1 , C tmp=x(c) do r = 1 , R y(r) = y(r) + A(r,c) * tmp enddo enddo ``` ### Loop body analysis: - LD A( r, c) to memory → 8 Byte x(c) ←→register → 0 Byte LD/ST y(r) ←→ Cache → 2 FLOP - $\rightarrow I = \frac{2 FLOF}{8 Byte}$ ### Data structure analysis: ``` • A(R,C) \rightarrow 8 * R * C Byte • X(C) \rightarrow 8 * C Byte • Y(R): LD/ST \rightarrow 2*8 * R Byte \rightarrow 2* R * C FLOP ``` Roofline Model (RLM) – Refined Vector triads ### The Roofline Model – refined: Vector triads: $P_{max}$ • Machine: 7 cores of Haswell@2.3GHz $(n_{core} = 7; f = 2.3 \frac{Gcy}{s})$ #### AVX performance on 1 core Haswell / Broadwell ### The Roofline Model – refined: Vector triads: $I \cdot b_S \& P$ - Machine: 7 cores of Haswell @2.3 (CoD) - STREAM triads BW: $b_S = 29 \frac{GB}{S}$ - do i = 1,N A(i)=B(i)+C(i)\*D(i) enddo - Computational Intensity (incl. WA; double precision): $I = \frac{2F}{5*8B} = 0.05 \frac{F}{B}$ ### Putting it together $$P_{max} = 7 * 2.3 \frac{Gcy}{s} * 5.33 \frac{F}{cy} = 85.8 \frac{GF}{s}$$ $$P = \min\left(85.8 \frac{GF}{s}, 0.05 \frac{F}{B} * 29 \frac{GB}{s}\right) = \min\left(85.8 \frac{GF}{s}, 1.45 \frac{GF}{s}\right) = \mathbf{1.45} \frac{\mathbf{GF}}{\mathbf{S}}$$ ### The Roofline Model – refined: Validate RLM PTfS 2024 Roofline Model (RLM) – Refined Dense Matrix Vector Multiplication ## The Roofline Model – refined: Dense MVM : $P_{max}$ • Machine: 7 cores of Haswell @2.3GHz $(n_{core} = 7; f = 2.3 \frac{Gcy}{s})$ ``` do c = 1 , C tmp=x(c) do r = 1 , R y(r)=y(r) + A(r,c)* tmp enddo enddo ``` For 1 AVX iteration (r:r+3) 2 AVX LDs + 1 AVX ST + 1 AVX FMA #### AVX performance on 1 core Haswell / Broadwell Bottleneck: LD iteration - 1 AVX iteration: $T_{max}^{inst} = 1cy$ - 1 AVX iteration → 4 loop iterations → 8 F • $$P_{max}^{core} = {}^{8F}/_{1cy} = 8 {}^{F}/_{cy}$$ # The Roofline Model – refined: Dense MVM: $I \cdot b_S \& P$ Machine: 7 cores of Haswell@2.3 GHz The second of th - Read-OnlyBW: $b_S = 32 \frac{GB}{s}$ - Computational Intensity (double precision): $I = 1/B_C^{mem} = \frac{1}{\frac{8B}{2F}} = 0.25 \frac{F}{B}$ Putting it together $$P_{max} = 7 * 2.3 \frac{Gcy}{s} * 8 \frac{F}{cy} = 128.8 \frac{GF}{s}$$ $$P = \min\left(128.8 \frac{GF}{S}, 0.25 \frac{F}{B} * 32 \frac{GB}{S}\right) = \min\left(128.8 \frac{GF}{S}, 8 \frac{GF}{S}\right) = 8 \frac{GF}{S}$$ ### The Roofline Model – refined: Dense MVM: Validate PTfS 2024 June 18, 2024 39 Roofline Model (RLM) – Refined Bad Code Implementation & Lower roofs # A not so simple Roofline example Example: do i=1,N; s=s+a(i); enddo in single precision on a 2.2 GHz Sandy Bridge (3-stage FP add pipeline) socket @ "large" N PTfS 2024 June 18, 2024 41 # Applicable peak for the summation loop ### Plain scalar code, no SIMD ``` LOAD r1.0 \leftarrow 0 i \leftarrow 1 loop: LOAD r2.0 \leftarrow a(i) ADD r1.0 \leftarrow r1.0+r2.0 ++i →? loop result ← r1.0 ``` PTfS 2024 # Applicable peak for the summation loop ## Scalar code, 3-way unrolling ``` LOAD r1.0 \leftarrow 0 LOAD r2.0 \leftarrow 0 LOAD r3.0 \leftarrow 0 i ← 1 loop: LOAD r4.0 \leftarrow a(i) LOAD r5.0 \leftarrow a(i+1) LOAD r6.0 \leftarrow a(i+2) ADD r1.0 \leftarrow r1.0 + r4.0 ADD r2.0 \leftarrow r2.0 + r5.0 ADD r3.0 \leftarrow r3.0 + r6.0 i+=3 \rightarrow ? loop result \leftarrow r1.0+r2.0+r3.0 ``` ADD pipes utilization: → 1/8 of ADD peak # Applicable peak for the summation loop ## SIMD-vectorized, 3-way unrolled ``` LOAD [r1.0,...,r1.7] \leftarrow [0,...,0] LOAD [r2.0,...,r2.7] \leftarrow [0,...,0] LOAD [r3.0,...,r3.7] \leftarrow [0,...,0] i ← 1 loop: LOAD [r4.0,...,r4.7] \leftarrow [a(i),...,a(i+7)] LOAD [r5.0,...,r5.7] \leftarrow [a(i+8),...,a(i+15)] LOAD [r6.0,...,r6.7] \leftarrow [a(i+16),...,a(i+23)] ADD r1 \leftarrow r1 + r4 ADD r2 \leftarrow r2 + r5 ADD r3 \leftarrow r3 + r6 i+=24 \rightarrow ? loop result \leftarrow r1.0+r1.1+...+r3.6+r3.7 ``` #### **ADD** pipes utilization: → ADD peak # Input to the roofline model ... on the example of do i=1,N; s=s+a(i); enddo in single precision PTfS 2024 June 18, 2024 45 Roofline Model (RLM) – Refined Summary # Prerequisites for the Roofline Model - Data transfer and core execution overlap perfectly! - Either the limit is core execution or it is data transfer - Slowest limiting factor "wins"; all others are assumed to have no impact - If two bottlenecks are "close", no interaction is assumed - Data access latency is ignored, i.e. perfect streaming mode - Achievable bandwidth is the limit - Chip must be able to saturate the bandwidth bottleneck(s) - Always model for full chip ### Factors to consider in the roofline model #### Bandwidth-bound (simple case) - Accurate traffic calculation (write-allocate, strided access, ...) → Intensity calculation - Attainable ≠ theoretical BW - Erratic access patterns may violate model assumptions #### Core-bound (may be complex) - Multiple bottlenecks: LD/ST, arithmetic, pipelines, SIMD, execution ports - Limit is linear in # of cores (or clock speed) PTfS 2024 ### Tracking code optimizations in the Roofline Model - Hit the BW bottleneck by good serial code (e.g., Ninja C++ → Fortran) - 2. Increase intensity to make better use of BW bottleneck (e.g., spatial loop blocking) - 3. Increase intensity and go from memory bound to core bound (e.g., temporal blocking) - 4. Hit the core bottleneck by good serial code (e.g., -fno-alias, SIMD intrinsics) # Monitoring jobs running on Fritz in the Roofline diagram ### Two cluster jobs... Rooflines: $P = min(P_{peak}, I * b_s)$ ClusterCockpit > LIKWID https://github.com/ClusterCockpit - LIKWID determines P and I regularly on each node - ClusterCockpit collects data and presents is Which is the "good" and the "bad" job? PTfS 2024 June 18, 2024 50 # Shortcomings of the roofline model - Saturation effects in multicore chips are not explained - Reason: Intra-Cache and memory transfers do (frequently) not overlap on a single core → Overlapp only between cores - Increase "pressure" on memory interface until it saturates $\rightarrow$ bottleneck: $b_s$ - It is not sufficient to measure single-core STREAM to make it work - In-cache performance is not correctly predicted - The ECM performance model gives more insight: G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power properties of modern multicore chips via simple machine models. Concurrency and Computation: Practice and Experience (2013). DOI: 10.1002/cpe.3180 Preprint: arXiv:1208.2908 Roofline Model (RLM) – Refined Code Balance and Machine Balance ### Machine balance for hardware characterization For quick comparisons the concept of machine balance is useful $$B_m = \frac{b_S}{P_{\text{peak}}}$$ - Machine Balance = How much input data can be delivered for each FP operation? ("Memory Gap characterization") - Assuming balanced MULT/ADD - Rough estimate: $B_m \ll B_c \rightarrow$ strongly memory-bound code - Typical values (main memory): Intel Haswell 14-core 2.3 GHz $B_m$ = 60 GB/s / (14 x 2.3 x 16) GF/s $\approx$ 0.12 B/F Intel Sandy Bridge 8-core 2.7 GHz $\approx$ 0.23 B/F Nvidia P100 $\approx$ 0.10 B/F Intel Xeon Phi Knights Landing (HBM) $\approx$ 0.16 B/F ### Machine balance over time PTfS 2024 # **RLM Case Study** Tall & Skinny Matrix-Transpose Times Tall & Skinny Matrix (TSMTTSM) Multiplication ## **TSMTTSM Multiplication** - Block of vectors → Tall & Skinny Matrix (e.g. 10<sup>7</sup> x 10<sup>1</sup> dense matrix) - Row-major storage format - Block vector subspace orthogonalization procedure requires, e.g. computation of scalar product between vectors of two blocks ■ TSMTTSM Mutliplication $K \gg N, M$ Assume: $\alpha = 1$ ; $\beta = 0$ $$M = \alpha + \beta$$ $$C = \alpha \qquad A^T \qquad * B + \beta C$$ ### TSMTTSM Multiplication General rule for dense matrix-matrix multiply: Use vendor-optimized GEMM, e.g. from Intel MKL<sup>1</sup>: $C_{mn} = \sum_{k=1}^{K} A_{mk} B_{kn}$ , m = 1..M, n = 1..N double | System | P <sub>peak</sub> [GF/s] | b <sub>s</sub> [GB/s] | Size | Perf. | Efficiency | |--------------------------------------|--------------------------|-----------------------|------|-----------|------------| | Intel Xeon E5 2660 v2<br>10c@2.2 GHz | 176 GF/s | 52 GB/s | SQ | 160 GF/s | 91% | | | | | TS | 16.6 GF/s | 6% | | Intel Xeon E5 2697 v3<br>14c@2.6GHz | 582 GF/s | 65 GB/s | SQ | 550 GF/s | 95% | | | | | TS | 22.8 GF/s | 4% | complex double Matrix sizes: Square (SQ): M=N= **K=15,000** Tall&Skinny (TS): M=N=16; **K=10,000,000** TS@MKL: Good or bad? <sup>1</sup>Intel Math Kernel Library (MKL) 11.3 ### **TSMTTSM** Roofline model #### Computational intensity $$I = \frac{\text{#flops}}{\text{#bytes (slowest data path)}}$$ N = M Optimistic model (minimum data transfer) assuming $M = N \ll K$ and double precision: $$I_{d} = \frac{2KMN}{8(2MN + KM + KN)} \frac{F}{B} \approx \frac{2MN}{8(M+N)} \frac{F}{B} = \frac{M}{8} \frac{F}{B}$$ complex double: $$I_{z} = \frac{8KMN}{16(2MN + KM + KN)} \frac{F}{B} \approx \frac{8MN}{16(M+N)} \frac{F}{B} = \frac{M}{4} \frac{F}{B}$$ ### TSMTTSM Roofline performance prediction Now choose $$M=N=16 \rightarrow I_d \approx \frac{16}{8} \frac{F}{B}$$ and $I_Z \approx \frac{16}{4} \frac{F}{B}$ , i.e. $B_d \approx 0.5 \frac{B}{F}$ , $B_Z \approx 0.25 \frac{B}{F}$ Intel Xeon E5 2660 v2 $$(b_S = 52\frac{GB}{s}) \rightarrow P = I_d \times b_S = 104\frac{GF}{s}$$ (double) Measured (MKL): 16.6 $\frac{GF}{s}$ Intel Xeon E5 2697 v3 $$(b_S = 65 \frac{GB}{s}) \rightarrow P = I_Z \times b_S = 240 \frac{GF}{s}$$ (double complex) Measured (MKL): 22.8 $\frac{GF}{s}$ → Potential speedup: 6–10x vs. MKL ### Can we implement a better TSMTTSM kernel than Intel? ``` Thread local copy of small (results) matrix 1 #pragma omp parallel 2 { double c_{tmp}[n*m] = \{0.\}; Long Loop (k): Parallel 5 #pragma omp for Outer Loop Unrolling for (int row = 0; row < k-1; row+=2) {_____ for (int bcol = 0; bcol < n; bcol++) { 8 #pragma simd Compiler directives for (int acol = 0; acol < m; acol++) {</pre> c_tmp[bcol*m+acol] += a[(row+0)*m + acol] * b[(row+0)*n + bcol] + Most operations a[(row+1)*m + acol] * b[(row+1)*n + bcol]; in cache 15 17 #pragma omp critical Reduction on for (int bcol = 0; bcol < n; bcol++) { small result matrix 19 #pragma simd for (int acol = 0; acol < m; acol++) {</pre> c[bcol*m+acol] += c_tmp[bcol*m+acol]; 23 24 } ``` Not shown: Inner Loop boundaries (n,m) known at compile time (kernel generation) k assumed to be even ## TSMTTSM MKL vs. "hand crafted" (OPT) TS: M=N=16; K=10,000,000 | System | P <sub>peak</sub> / b <sub>S</sub> | Version | Performance | RLM Efficiency | |-----------------------|------------------------------------|---------|-------------|----------------| | Intel Xeon E5 2660 v2 | 176 GF/s | TS OPT | 98 GF/s | 94 % | | 10c@2.2 GHz | 52 GB/s | TS MKL | 16.6 GF/s | 16 % | | Intel Xeon E5 2697 v3 | 582 GF/s | TS OPT | 159 GF/s | 66 % | | 14c@2.6GHz | 65 GB/s | TS MKL | 22.8 GF/s | 9.5 % |