Outline 1.) N-Body Methods 2.) Dynamic Programming 3.) Sparse Linear Algebra 4.) Unstructured Grids 5.) Conclusion 1 N-Body Methods Overview N-Body Methods • Particle simulation – Large number of simple entities – Each entity is dependent on each other – Regularly structured FP computations • Xeon Phi offers acceptable performance – Regularity leads to good auto-vectorization – Cache misses result in very high memory latency • L1 misses mostly lead to L2 misses N-Body platform comparison Source: [3] Dynamic Programming Overview Dynamic Programming • Problem is divided into smaller problems – Each smaller (and simpler) problem is solved first – Results are combined to compute larger problems – Often used for sequence alignment algorithms • Example: DNA Sequence Alignment – Xeon Phi 5110P vs. 2 GPU configurations ( GeForce GTX 480, K20c ) • Outperforms GTX 480 • Reaches 74,3% performance compared to the K20c – Computation heavily based on peak integer performance • Xeon Phi has ~ 57,3 % peak integer performance of the K20c Sparse Linear Algebra Overview Sparse Linear Algebra • Matrices with scattered non-zero entries • Linear system solvers, Eigen solvers • Different matrix structures – Patterns can arise – Irregularities • False predictions • Vector registers contain zero entries Case Study: Sparse MatrixVector Operations • Large sparse matrix A • Dense vector x • Multiply-Add • Simple instructions, large quantitiy • Parallel execution • Access patterns Image Source: [2] Bottleneck Considerations Bandwidth limits Memory bandwidth Computation bandwidth Memory latency SIMD efficiency Vector register utilization Core utilization Algorithm Showcase • CSR / SELLPACK format – Column blocks – Finite-Window sorting • Adaptive load balancing Image Source: [1] Evaluation: Effects of single improvements Source: [1] Evaluation: Platform comparison Source: [1] [Gflops/s] Evaluation: Platform comparison Matrices from the UFL Sparse Matrix Collection (/20) Matrix 8 has much more non-zero entries than the rest. Based on data from [2] Conclusion: Sparse Linear Algebra on Xeon Phi • High potential in sparse linear algebra – Wide vector registers – High number of cores • Main problem points: – Irregular access patterns – Sparse data structures – Memory latency high on L2 cache misses • Performance extremely dependent on data Unstructured Grids Image Source: [4] 1 6 Image Source: [5] Image Source: [6] Unstructured Grids ● ● ● ● Partitioning the grid (few connections) Irregular acces-patterns => bad for auto-vectorization Software prefetching grid cells Ideal prefetch amount: MM latency * MM bandwidth 1 9 Unstructured Grids Data races can not be determined statically „loop over edges and accessing data on edges“ => data races 2 0 Airfoil Benchmark ● 2D inviscid airfoil code ● considered bandwith bound ● 2,800,000 cell mesh Image Source: [7] 2 1 Competitors 3500 2 × Xeon E5-2680 Xeon Phi 5110P Tesla K40 3000 2500 2000 1500 1000 500 0 Price $ # Cores Mem-Bandwidth double GFlops Last Level Cache 2 2 Airfoil on Xeon Phi At the highest level: Message Passing Protocol (MPI) At the lower level: OpenMP + Vector intrinsics 2 3 Xeon Phi Benchmark graph Image Source: [8] 2 4 Benchmark graph Image Source: [8] 2 5 Unstructured Grids conclusion Benchmark probably not fair (price) DO manual Vectorization: Speedup 1.7-1.82 Use MPI + OpenMP 2 6 Who did what Philipp Bartels: Architecture introduction: slide 18 and 1 to 8 Presentation team x2: slide 1 and 16 to 27 Eugen Seljutin: Presentation team x2: slide 2 to 15 2 7 Credits [1] Xing Liu et al.: Efficient Sparse Matrix-Vector Multiplication on x86-Based Many-Core Processors [2] Erik Saule et al.: Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi [3] Konstantinos Krommydas et al.: On the Characterization of OpenCL Dwarfs on Fixed and Reconfigurable Platforms [4] http://en.wikipedia.org/wiki/Unstructured_grid [5] http://view.eecs.berkeley.edu/wiki/Unstructured_Grids [6] https://cfd.gmu.edu/~jcebral/gallery/vis04/index.html 2 8 Credits [7] http://en.wikipedia.org/wiki/File:Airfoil_with_flow.png [8] I. Z. Reguly et al.: Vectorizing Unstructured Mesh Computations for Many-core Architecture 2 9

© Copyright 2020