ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK DONGARRA UNIVERSIT Y OF TENNESSEE OAK RIDGE NATIONAL LAB What Is LINPACK? LINPACK is a package of mathematical software for solving problems in linear algebra, mainly dense linear systems of linear equations. LINPACK: “LINear algebra PACKage” Written in Fortran 66 The project had its origins in 1974 The project had four primary contributors: myself when I was at Argonne National Lab, Jim Bunch from the University of California-San Diego, Cleve Moler who was at New Mexico at that time, and Pete Stewart from the University of Maryland. LINPACK as a software package has been largely superseded by LAPACK, which has been designed to run efficiently on shared-memory, vector supercomputers. Computing in 1974 High Performance Computers: IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10, Honeywell 6030 Fortran 66 Trying to achieve software portability Floating point operations where expensive We didn’t think much about energy used Run efficiently BLAS (Level 1) Vector operations Software released in 1979 About the time of the Cray 1 LINPACK Benchmark? The Linpack Benchmark is a measure of a computer’s floating-point rate of execution. It is determined by running a computer program that solves a dense system of linear equations. Over the years the characteristics of the benchmark has changed a bit. In fact, there are three benchmarks included in the Linpack Benchmark report. LINPACK Benchmark Dense linear system solve with LU factorization using partial pivoting Operation count is: 2/3 n3 + O(n2) Benchmark Measure: MFlop/s Original benchmark measures the execution rate for a Fortran program on a matrix of size 100x100. Accidental Benchmarker Appendix B of the Linpack Users’ Guide Designed to help users extrapolate execution time for Linpack software package First benchmark report from 1977; Cray 1 to DEC PDP-10 Top500 List of Supercomputers H. Meuer, H. Simon, E. Strohmaier, & JD Rate - Listing of the 500 most powerful Computers in the World - Yardstick: Rmax from LINPACK MPP TPP performance Ax=b, dense problem - Updated twice a year Size SC‘xy in the States in November Meeting in Germany in June - All data available from www.top500.org Over Last 20 Years - Performance Development 1E+09 123 PFlop/s 100 Pflop/s 100000000 16.3 PFlop/s 10 Pflop/s 10000000 1 Pflop/s 1000000 SUM 100 Tflop/s 100000 N=1 10 Tflop/s 60.8 TFlop/s 10000 6-8 years 1 Tflop/s 1000 1.17 TFlop/s 100 Gflop/s 100 N=500 My Laptop (70 Gflop/s) 59.7 GFlop/s 10 Gflop/s 10 My iPad2 & iPhone 4s (1.02 Gflop/s) 1 Gflop/s 1 100 Mflop/s 0,1 400 MFlop/s 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2012 June 2012: The TOP10 Rank Site Computer Country Cores Rmax [Pflops] % of Peak Power [MW] MFlops/ Watt 1 DOE / NNSA L Livermore Nat Lab Sequoia, BlueGene/Q (16c) + custom USA 1,572,864 16.3 81 8.6 1895 2 RIKEN Advanced Inst for Comp Sci K computer Fujitsu SPARC64 VIIIfx (8c) + custom Japan 705,024 10.5 93 12.7 830 3 DOE / OS Argonne Nat Lab Mira, BlueGene/Q (16c) + custom USA 786,432 8.16 81 3.95 2069 4 Leibniz Rechenzentrum SuperMUC, Intel (8c) + IB Germany 147,456 2.90 90* 3.52 823 5 Nat. SuperComputer Center in Tianjin Tianhe-1A, NUDT Intel (6c) + Nvidia GPU (14c) + custom China 186,368 2.57 55 4.04 636 6 DOE / OS Oak Ridge Nat Lab Jaguar, Cray AMD (16c) + custom USA 298,592 1.94 74 5.14 377 7 CINECA Fermi, BlueGene/Q (16c) + custom Italy 163,840 1.73 82 .821 2099 8 Forschungszentrum Juelich (FZJ) JuQUEEN, BlueGene/Q (16c) + custom Germany 131,072 1.38 82 .657 2099 9 Commissariat a l'Energie Atomique (CEA) Curie, Bull Intel (8c) + IB France 77,184 1.36 82 2.25 604 10 Nat. Supercomputer Center in Shenzhen Nebulea, Dawning Intel (6) + Nvidia GPU (14c) + IB China 120,640 1.27 43 2.58 493 500 Energy Comp IBM Cluster, Intel + IB .061 93* Italy 4096 Accelerators (58 systems) 60 Intel MIC (1) Clearspeed CSX600 (0) 50 ATI GPU (2) IBM PowerXCell 8i (2) Systems 40 NVIDIA 2070 (10) 30 NVIDIA 2050(12) NVIDIA 2090 (31) 20 10 0 2006 2007 2008 2009 2010 2011 27 US 7 China 4 Japan 3 Russia 2 France 2 Germany 2012 2 India 2 Italy 2 Poland 1 Australia 1 Brazil 1 Canada 1 Singapore 1 Spain 1 Taiwan 1 UK Countries Share Absolute Counts US: 252 China: 68 Japan: 35 UK: 25 France: 22 Germany: 20 10 28 Systems at > Pflop/s (Peak) 9/20/2012 11 Top500 in France rank manufact urer installation_site_na me procs r_peak 9 Bull Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB CEA/TGCC-GENCI 77184 1667174 17 Bull Bull bullx super-node S6010/S6030 CEA 138368 1254550 29 IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom EDF R&D 65536 838861 30 IBM BlueGene/Q, Power BQC 16C 1.60GHz, Custom IDRIS/GENCI 65536 838861 45 Bull Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull 20480 442368 69 HP HP POD - Cluster Platform 3000, X5675 3.06 GHz, InfB Airbus 24192 296110 75 SGI SGI Altix ICE 8200EX, Xeon E5472 3.0/X5560 2.8 GHz (GENCI-CINES) 23040 267878 95 HP Cluster Platform 3000 BL2x220, L54xx 2.5 Ghz, InfB Government 24704 247040 97 Bull Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB CEA 9440 203904 104 IBM iDataPlex, Xeon X56xx 6C 2.93 GHz, InfB EDF R&D 16320 191270 115 149 Bull IBM Bullx B505, Xeon E56xx (Westmere-EP) 2.40 GHz, InfB Blue Gene/P Solution CEA IDRIS 7020 40960 274560 139264 162 168 171 Bull Bull Bull Bullx B505, Xeon E5640 2.67 GHz, InfB CEA/TGCC-GENCI BULL Novascale R422-E2 CEA Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull 5040 11520 5760 198162 129998 124416 173 201 240 SGI IBM HP SGI Altix ICE 8200EX, Xeon quad core 3.0 GHz Blue Gene/P Solution Cluster Platform 3000, Xeon X5570 2.93 GHz, InfB Total EDF R&D Manufacturing 10240 32768 8576 122880 111411 100511 244 Bull Bull bullx super-node S6010/S6030 Bull 11520 104602 245 Bull Bullx S6010 Cluster, Xeon 2.26 Ghz 8-core, InfB CEA/TGCC-GENCI 11520 104417 365 IBM xSeries x3550M3 Cluster, Xeon X5650 6C 2.66 GHz, GigE Information 13068 139044 477 IBM iDataPlex, Xeon E55xx QC 2.26 GHz, GigE 13056 118026 computer Financial Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 Linpack Efficiency 100% 90% Linpack Efficiency 80% 70% 60% 50% 40% 30% 20% 10% 0% 0 100 200 300 400 500 Performance Development in Top500 1E+11 1E+10 1 Eflop/s 1E+09 100 Pflop/s 00000000 10 Pflop/s 10000000 1 Pflop/s 1000000 N=1 100 Tflop/s 100000 10 Tflop/s 10000 1 Tflop/s 1000 N=500 100 Gflop/s 100 10 Gflop/s 10 1 Gflop/s 1 0,1 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 The High Cost of Data Movement •Flop/s or percentage of peak flop/s become much less relevant Approximate power costs (in picoJoules) 2011 2018 DP FMADD flop 100 pJ 10 pJ DP DRAM read 4800 pJ 1920 pJ Local Interconnect 7500 pJ 2500 pJ Cross System 9000 pJ 3500 pJ Source: John Shalf, LBNL •Algorithms & Software: minimize data movement; perform more work per unit data movement. Energy Cost Challenge At ~$1M per MW energy costs are substantial 10 Pflop/s in 2011 uses ~10 MWs 1 Eflop/s in 2018 > 100 MWs DOE Target: 1 Eflop/s in 2018 at 20 MWs 19 Potential System Architecture with a cap of $200M and 20MW Systems System peak Power 2012 2022 Difference Today & 2022 20 Pflop/s 1 Eflop/s O(100) BG/Q Computer 8.6 MW ~20 MW (2 Gflops/W) (50 Gflops/W) 1.6 PB 32 - 64 PB O(10) 205 GF/s 1.2 or 15TF/s O(10) – O(100) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 Threads O(1k) or 10k O(100) – O(1000) Total Node Interconnect BW 20 GB/s 200-400GB/s O(10) 98,304 O(100,000) or O(1M) O(100) – O(1000) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10) System memory Node performance System size (nodes) (16*96*1024) (16*1.6GHz*8) (96*1024) Potential System Architecture with a cap of $200M and 20MW Systems System peak Power 2012 2022 Difference Today & 2022 20 Pflop/s 1 Eflop/s O(100) BG/Q Computer 8.6 MW ~20 MW (2 Gflops/W) (50 Gflops/W) 1.6 PB 32 - 64 PB O(10) 205 GF/s 1.2 or 15TF/s O(10) – O(100) Node memory BW 42.6 GB/s 2 - 4TB/s O(1000) Node concurrency 64 Threads O(1k) or 10k O(100) – O(1000) Total Node Interconnect BW 20 GB/s 200-400GB/s O(10) 98,304 O(100,000) or O(1M) O(100) – O(1000) Total concurrency 5.97 M O(billion) O(1,000) MTTI 4 days O(<1 day) - O(10) System memory Node performance System size (nodes) (16*96*1024) (16*1.6GHz*8) (96*1024) Critical Issues at Peta & Exascale for Algorithm and Software Design Synchronization-reducing algorithms Break Fork-Join model Communication-reducing algorithms Use methods which have lower bound on communication Mixed precision methods 2x speed of ops and 2x speed for data movement Autotuning Today’s machines are too complicated, build “smarts” into software to adapt to the hardware Fault resilient algorithms Implement algorithms that can recover from failures/bit flips Reproducibility of results Today we can’t guarantee this. We understand the issues, but some of our “colleagues” have a hard time with this. Major Changes to Algorithms/Software • Must rethink the design of our algorithms and software Manycore and Hybrid architectures are disruptive technology Similar to what happened with cluster computing and message passing Rethink and rewrite the applications, algorithms, and software Data movement is expensive Flops are cheap 23 Dense Linear Algebra Software Evolution LINPACK (70's) vector operations Level 1 BLAS LAPACK (80's) block operations Level 3 BLAS PBLAS BLACS ScaLAPACK (90's) block cyclic data distribution PLASMA (00's) tile operations (message passing) tile layout dataflow scheduling PLASMA Principles Tile Algorithms minimize capacity misses LAPACK CPU cache MEM Tile Matrix Layout minimize conflict misses PLASMA CPU CPU CPU CPU cache cache cache cache MEM Dynamic DAG Scheduling minimizes idle time More overlap Asynchronous ops Fork-Join Parallelization of LU and QR. Cores Parallelize the update: dgemm • Easy and done in any reasonable software. • This is the 2/3n3 term in the FLOPs count. • Can be done efficiently with LAPACK+multithreaded BLAS - PLASMA/MAGMA: Parallel Linear Algebra s/w for Multicore/Hybrid Architectures Objectives High utilization of each core Scaling to large number of cores Synchronization reducing algorithms Methodology Dynamic DAG scheduling (QUARK) Explicit parallelism Implicit communication Fine granularity / block data layout Arbitrary DAG with dynamic scheduling Fork-join parallelism DAG scheduled parallelism Communication Avoiding QR Example D0 D1 D2 D3 Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR R0 R0 R0R R D0 R1 R2 D1 R2 R3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. D2 D3 Communication Avoiding QR Example D0 D1 D2 D3 Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR R0 R0 R0R R D0 R1 R2 D1 R2 R3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. D2 D3 Communication Avoiding QR Example D0 D1 D2 D3 Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR R0 R0 R0R R D0 R1 R2 D1 R2 R3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. D2 D3 Communication Avoiding QR Example D0 D1 D2 D3 Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR R0 R0 R0R R D0 R1 R2 D1 R2 R3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. D2 D3 Communication Avoiding QR Example D0 D1 D2 D3 Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR Domain_Tile_QR R0 R0 R0R R D0 R1 R2 D1 R2 R3 A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd Conference on Hypercube Concurrent Computers and Applications, volume II, Applications, pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State. D2 D3 PowerPack 2.0 The PowerPack platform consists of software and hardware instrumentation. Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/ Power for QR Factorization LAPACK’s QR Factorization Fork-join based MKL’s QR Factorization Fork-join based PLASMA’s Conventional QR Factorization DAG based PLASMA’s Communication Reducing QR Factorization DAG based dual-socket quad-core Intel Xeon E5462 (Harpertown) processor @ 2.80GHz (8 cores total) w / MLK BLAS matrix size is very tall and skinny (mxn is 1,152,000 by 288) The standard Tridiagonal reduction xSYTRD LAPACK xSYTRD: 1. 2. Apply left-right transformations Q A Q* to the panel Update the remaining submatrix A33 step k: Q A Q* then update step k+1 For the symmetric eigenvalue problem: First stage takes: • 90% of the time if only eigenvalues • 50% of the time if eigenvalues and eigenvectors æ A22 ö ç ÷ è A32 ø The standard Tridiagonal reduction xSYTRD Characteristics 1. Phase 1 requires : o 4 panel vector multiplications, o 1 symmetric matrix vector multiplication with A33, o Cost 2(n-k)2b Flops. 2. Phase 2 requires: o Symmetric update of A33 using SYRK, o Cost 2(n-k)2b Flops. Observations • • • • • Too many Level 2 BLAS ops, Relies on panel factorization, Total cost 4n3/3 Bulk sync phases, Memory bound algorithm. Symmetric Eigenvalue Problem • Standard reduction algorithm are very slow on multicore. • Step1: Reduce the dense matrix to band. • Matrix-matrix operations, high degree of parallelism • Step2: Bulge Chasing on the band matrix • by group and cache aware Symmetric Eigenvalues eigenvalues only Singular Values singular values only Experiments on eight-socket six-core AMD Opteron 2.4 GHz processors with MKL V10.3. Block DAG based to banded form, then pipelined group chasing to tridiagaonal form. The reduction to condensed form accounts for the factor of 50 improvement over LAPACK Execution rates based on 4/3n3 ops Summary These are old ideas Swarm,…) (today SMPss, StarPU, Charm++, ParalleX, Major Challenges are ahead for extreme computing Power Levels of Parallelism Communication Hybrid Fault Tolerance … and many others not discussed here Not just a programming assignment. This opens up many new opportunities for applied mathematicians and computer scientists Collaborators / Software / Support • PLASMA http://icl.cs.utk.edu/plasma/ MAGMA http://icl.cs.utk.edu/magma/ Quark (RT for Shared Memory) http://icl.cs.utk.edu/quark/ PaRSEC(Parallel Runtime Scheduling and Execution Control) http://icl.cs.utk.edu/parsec/ • Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver INRIA, France KAUST, Saudi Arabia These tools are being applied to a range of applications beyond dense LA: 40 Sparse direct, Sparse iterations methods and Fast Multipole Methods

© Copyright 2020