ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA

ON THE FUTURE OF HIGH
PERFORMANCE COMPUTING:
HOW TO THINK FOR PETA
AND EXASCALE COMPUTING
JACK DONGARRA
UNIVERSIT Y OF TENNESSEE
OAK RIDGE NATIONAL LAB
What Is LINPACK?
LINPACK is a package of mathematical software for solving
problems in linear algebra, mainly dense linear systems of linear
equations.
LINPACK: “LINear algebra PACKage”
 Written in Fortran 66
The project had its origins in 1974
The project had four primary contributors: myself when I was
at Argonne National Lab, Jim Bunch from the University of
California-San Diego, Cleve Moler who was at New Mexico at
that time, and Pete Stewart from the University of Maryland.
LINPACK as a software package has been largely superseded
by LAPACK, which has been designed to run efficiently on
shared-memory, vector supercomputers.
Computing in 1974
High Performance Computers:
 IBM 370/195, CDC 7600, Univac 1110, DEC PDP-10,
Honeywell 6030
Fortran 66
Trying to achieve software portability
Floating point operations where expensive
We didn’t think much about energy used
Run efficiently
BLAS (Level 1)
 Vector operations
Software released in 1979
 About the time of the Cray 1
LINPACK Benchmark?
The Linpack Benchmark is a measure of a computer’s
floating-point rate of execution.
 It is determined by running a computer program that solves
a dense system of linear equations.
Over the years the characteristics of the benchmark
has changed a bit.
 In fact, there are three benchmarks included in the Linpack
Benchmark report.
LINPACK Benchmark
 Dense linear system solve with LU factorization using partial
pivoting
 Operation count is: 2/3 n3 + O(n2)
 Benchmark Measure: MFlop/s
 Original benchmark measures the execution rate for a
Fortran program on a matrix of size 100x100.
Accidental Benchmarker
Appendix B of the Linpack Users’ Guide
 Designed to help users extrapolate execution
time for Linpack software package
First benchmark report from 1977;
 Cray 1 to DEC PDP-10
Top500 List of Supercomputers
H. Meuer, H. Simon, E. Strohmaier, & JD
Rate
- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from LINPACK MPP
TPP performance
Ax=b, dense problem
- Updated twice a year
Size
SC‘xy in the States in November
Meeting in Germany in June
- All data available from www.top500.org
Over Last 20 Years - Performance
Development
1E+09
123 PFlop/s
100 Pflop/s
100000000
16.3 PFlop/s
10 Pflop/s
10000000
1 Pflop/s
1000000
SUM
100 Tflop/s
100000
N=1
10 Tflop/s
60.8 TFlop/s
10000
6-8 years
1 Tflop/s
1000
1.17 TFlop/s
100 Gflop/s
100
N=500
My Laptop (70 Gflop/s)
59.7 GFlop/s
10 Gflop/s
10
My iPad2 & iPhone 4s (1.02 Gflop/s)
1 Gflop/s
1
100 Mflop/s
0,1
400 MFlop/s
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011 2012
June 2012: The TOP10
Rank
Site
Computer
Country
Cores
Rmax
[Pflops]
% of
Peak
Power
[MW]
MFlops/
Watt
1
DOE / NNSA
L Livermore Nat Lab
Sequoia, BlueGene/Q (16c) + custom
USA
1,572,864
16.3
81
8.6
1895
2
RIKEN Advanced Inst for
Comp Sci
K computer Fujitsu SPARC64 VIIIfx (8c) +
custom
Japan
705,024
10.5
93
12.7
830
3
DOE / OS
Argonne Nat Lab
Mira, BlueGene/Q (16c) + custom
USA
786,432
8.16
81
3.95
2069
4
Leibniz Rechenzentrum
SuperMUC, Intel (8c) + IB
Germany
147,456
2.90
90*
3.52
823
5
Nat. SuperComputer Center
in Tianjin
Tianhe-1A, NUDT
Intel (6c) + Nvidia GPU (14c) + custom
China
186,368
2.57
55
4.04
636
6
DOE / OS
Oak Ridge Nat Lab
Jaguar, Cray
AMD (16c) + custom
USA
298,592
1.94
74
5.14
377
7
CINECA
Fermi, BlueGene/Q (16c) + custom
Italy
163,840
1.73
82
.821
2099
8
Forschungszentrum Juelich
(FZJ)
JuQUEEN, BlueGene/Q (16c) + custom
Germany
131,072
1.38
82
.657
2099
9
Commissariat a l'Energie
Atomique (CEA)
Curie, Bull Intel (8c) + IB
France
77,184
1.36
82
2.25
604
10
Nat. Supercomputer Center
in Shenzhen
Nebulea, Dawning Intel (6) + Nvidia GPU
(14c) + IB
China
120,640
1.27
43
2.58
493
500
Energy Comp
IBM Cluster, Intel + IB
.061
93*
Italy
4096
Accelerators (58 systems)
60
Intel MIC (1)
Clearspeed CSX600 (0)
50
ATI GPU (2)
IBM PowerXCell 8i (2)
Systems
40
NVIDIA 2070 (10)
30
NVIDIA 2050(12)
NVIDIA 2090 (31)
20
10
0
2006
2007
2008
2009
2010
2011
27 US
7 China
4 Japan
3 Russia
2 France
2 Germany
2012 2 India
2 Italy
2 Poland
1 Australia
1 Brazil
1 Canada
1 Singapore
1 Spain
1 Taiwan
1 UK
Countries Share
Absolute Counts
US:
252
China:
68
Japan:
35
UK:
25
France:
22
Germany: 20
10
28 Systems at > Pflop/s (Peak)
9/20/2012
11
Top500 in France
rank
manufact
urer
installation_site_na
me
procs
r_peak
9
Bull
Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB
CEA/TGCC-GENCI
77184
1667174
17
Bull
Bull bullx super-node S6010/S6030
CEA
138368
1254550
29
IBM
BlueGene/Q, Power BQC 16C 1.60GHz, Custom
EDF R&D
65536
838861
30
IBM
BlueGene/Q, Power BQC 16C 1.60GHz, Custom
IDRIS/GENCI
65536
838861
45
Bull
Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull
20480
442368
69
HP
HP POD - Cluster Platform 3000, X5675 3.06 GHz, InfB
Airbus
24192
296110
75
SGI
SGI Altix ICE 8200EX, Xeon E5472 3.0/X5560 2.8 GHz
(GENCI-CINES)
23040
267878
95
HP
Cluster Platform 3000 BL2x220, L54xx 2.5 Ghz, InfB
Government
24704
247040
97
Bull
Bullx B510, Xeon E5-2680 8C 2.700GHz, InfB
CEA
9440
203904
104
IBM
iDataPlex, Xeon X56xx 6C 2.93 GHz, InfB
EDF R&D
16320
191270
115
149
Bull
IBM
Bullx B505, Xeon E56xx (Westmere-EP) 2.40 GHz, InfB
Blue Gene/P Solution
CEA
IDRIS
7020
40960
274560
139264
162
168
171
Bull
Bull
Bull
Bullx B505, Xeon E5640 2.67 GHz, InfB
CEA/TGCC-GENCI
BULL Novascale R422-E2
CEA
Bullx B510, Xeon E5 (Sandy Bridge - EP) 8C 2.70GHz, InfB Bull
5040
11520
5760
198162
129998
124416
173
201
240
SGI
IBM
HP
SGI Altix ICE 8200EX, Xeon quad core 3.0 GHz
Blue Gene/P Solution
Cluster Platform 3000, Xeon X5570 2.93 GHz, InfB
Total
EDF R&D
Manufacturing
10240
32768
8576
122880
111411
100511
244
Bull
Bull bullx super-node S6010/S6030
Bull
11520
104602
245
Bull
Bullx S6010 Cluster, Xeon 2.26 Ghz 8-core, InfB
CEA/TGCC-GENCI
11520
104417
365
IBM
xSeries x3550M3 Cluster, Xeon X5650 6C 2.66 GHz, GigE Information
13068
139044
477
IBM
iDataPlex, Xeon E55xx QC 2.26 GHz, GigE
13056
118026
computer
Financial
Linpack Efficiency
100%
90%
Linpack Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
Linpack Efficiency
100%
90%
Linpack Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
Linpack Efficiency
100%
90%
Linpack Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
Performance Development in Top500
1E+11
1E+10
1 Eflop/s
1E+09
100 Pflop/s
00000000
10 Pflop/s
10000000
1 Pflop/s
1000000
N=1
100 Tflop/s
100000
10 Tflop/s
10000
1 Tflop/s
1000
N=500
100 Gflop/s
100
10 Gflop/s
10
1 Gflop/s
1
0,1
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
The High Cost of Data Movement
•Flop/s or percentage of peak flop/s become
much less relevant
Approximate power costs (in picoJoules)
2011
2018
DP FMADD flop
100 pJ
10 pJ
DP DRAM read
4800 pJ
1920 pJ
Local Interconnect
7500 pJ
2500 pJ
Cross System
9000 pJ
3500 pJ
Source: John Shalf, LBNL
•Algorithms & Software: minimize data
movement; perform more work per unit data
movement.
Energy Cost Challenge
At ~$1M per MW energy costs are
substantial
 10 Pflop/s in 2011 uses ~10 MWs
 1 Eflop/s in 2018 > 100 MWs
 DOE Target: 1 Eflop/s in 2018 at 20 MWs
19
Potential System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
2012
2022
Difference
Today & 2022
20 Pflop/s
1 Eflop/s
O(100)
BG/Q
Computer
8.6 MW
~20 MW
(2 Gflops/W)
(50 Gflops/W)
1.6 PB
32 - 64 PB
O(10)
205 GF/s
1.2 or 15TF/s
O(10) – O(100)
Node memory BW
42.6 GB/s
2 - 4TB/s
O(1000)
Node concurrency
64
Threads
O(1k) or 10k
O(100) – O(1000)
Total Node Interconnect
BW
20 GB/s
200-400GB/s
O(10)
98,304
O(100,000) or O(1M)
O(100) – O(1000)
Total concurrency
5.97 M
O(billion)
O(1,000)
MTTI
4 days
O(<1 day)
- O(10)
System memory
Node performance
System size (nodes)
(16*96*1024)
(16*1.6GHz*8)
(96*1024)
Potential System Architecture
with a cap of $200M and 20MW
Systems
System peak
Power
2012
2022
Difference
Today & 2022
20 Pflop/s
1 Eflop/s
O(100)
BG/Q
Computer
8.6 MW
~20 MW
(2 Gflops/W)
(50 Gflops/W)
1.6 PB
32 - 64 PB
O(10)
205 GF/s
1.2 or 15TF/s
O(10) – O(100)
Node memory BW
42.6 GB/s
2 - 4TB/s
O(1000)
Node concurrency
64
Threads
O(1k) or 10k
O(100) – O(1000)
Total Node Interconnect
BW
20 GB/s
200-400GB/s
O(10)
98,304
O(100,000) or O(1M)
O(100) – O(1000)
Total concurrency
5.97 M
O(billion)
O(1,000)
MTTI
4 days
O(<1 day)
- O(10)
System memory
Node performance
System size (nodes)
(16*96*1024)
(16*1.6GHz*8)
(96*1024)
Critical Issues at Peta & Exascale for
Algorithm and Software Design
Synchronization-reducing algorithms
 Break Fork-Join model
Communication-reducing algorithms
 Use methods which have lower bound on communication
Mixed precision methods
 2x speed of ops and 2x speed for data movement
Autotuning
 Today’s machines are too complicated, build “smarts” into
software to adapt to the hardware
Fault resilient algorithms
 Implement algorithms that can recover from failures/bit
flips
Reproducibility of results
 Today we can’t guarantee this. We understand the issues,
but some of our “colleagues” have a hard time with this.
Major Changes to Algorithms/Software
• Must rethink the design of our
algorithms and software
Manycore and Hybrid architectures are
disruptive technology
Similar to what happened with cluster
computing and message passing
Rethink and rewrite the applications,
algorithms, and software
Data movement is expensive
Flops are cheap 23
Dense Linear Algebra
Software Evolution
LINPACK (70's)
vector operations

Level 1 BLAS
LAPACK (80's)
block operations

Level 3 BLAS

PBLAS

BLACS
ScaLAPACK (90's)
block cyclic
data distribution
PLASMA (00's)
tile operations
(message passing)

tile layout

dataflow scheduling
PLASMA
Principles

Tile Algorithms

minimize capacity misses
LAPACK
CPU
cache
MEM

Tile Matrix Layout

minimize conflict misses
PLASMA
CPU
CPU
CPU
CPU
cache
cache
cache
cache
MEM

Dynamic DAG Scheduling

minimizes idle time

More overlap

Asynchronous ops
Fork-Join Parallelization of LU and QR.
Cores
Parallelize the update:
dgemm
• Easy and done in any reasonable software.
• This is the 2/3n3 term in the FLOPs count.
• Can be done efficiently with LAPACK+multithreaded BLAS
-
PLASMA/MAGMA: Parallel Linear Algebra
s/w for Multicore/Hybrid Architectures
Objectives
 High utilization of each core
 Scaling to large number of cores
 Synchronization reducing algorithms
Methodology




Dynamic DAG scheduling (QUARK)
Explicit parallelism
Implicit communication
Fine granularity / block data layout
Arbitrary DAG with dynamic scheduling
Fork-join
parallelism
DAG scheduled
parallelism
Communication Avoiding QR
Example
D0
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
R0
R0
R0R R
D0
R1
R2
D1
R2
R3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
D2
D3
Communication Avoiding QR
Example
D0
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
R0
R0
R0R R
D0
R1
R2
D1
R2
R3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
D2
D3
Communication Avoiding QR
Example
D0
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
R0
R0
R0R R
D0
R1
R2
D1
R2
R3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
D2
D3
Communication Avoiding QR
Example
D0
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
R0
R0
R0R R
D0
R1
R2
D1
R2
R3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
D2
D3
Communication Avoiding QR
Example
D0
D1
D2
D3
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
R0
R0
R0R R
D0
R1
R2
D1
R2
R3
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620, Pasadena, CA, Jan. 1988. ACM. Penn. State.
D2
D3
PowerPack 2.0
The PowerPack platform consists of software and hardware instrumentation.
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
Power for QR Factorization
LAPACK’s QR Factorization
Fork-join based
MKL’s QR Factorization
Fork-join based
PLASMA’s Conventional
QR Factorization
DAG based
PLASMA’s Communication
Reducing QR Factorization
DAG based
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS
matrix size is very tall and skinny (mxn is 1,152,000 by 288)
The standard Tridiagonal reduction xSYTRD
 LAPACK xSYTRD:
1.
2.
Apply left-right transformations Q A Q* to the panel
Update the remaining submatrix A33
step k:
Q A Q* then update  step k+1
For the symmetric eigenvalue problem:
First stage takes:
• 90% of the time if only eigenvalues
• 50% of the time if eigenvalues and eigenvectors
æ A22 ö
ç ÷
è A32 ø
The standard Tridiagonal reduction xSYTRD
Characteristics
1. Phase 1 requires :
o 4 panel vector multiplications,
o 1 symmetric matrix vector multiplication with A33,
o Cost 2(n-k)2b Flops.
2. Phase 2 requires:
o Symmetric update of A33 using SYRK,
o Cost 2(n-k)2b Flops.
 Observations
•
•
•
•
•
Too many Level 2 BLAS ops,
Relies on panel factorization,
Total cost 4n3/3
Bulk sync phases,
Memory bound algorithm.
Symmetric Eigenvalue Problem
• Standard reduction algorithm are very slow on multicore.
• Step1: Reduce the dense matrix to band.
• Matrix-matrix operations, high degree of parallelism
• Step2: Bulge Chasing on the band matrix
• by group and cache aware
Symmetric
Eigenvalues
eigenvalues only
Singular Values
singular values only
Experiments on eight-socket six-core AMD Opteron 2.4 GHz
processors with MKL V10.3.
Block DAG based to banded form, then pipelined group
chasing to tridiagaonal form.
The reduction to condensed form accounts for the factor
of 50 improvement over LAPACK
Execution rates based on 4/3n3 ops
Summary
These are old ideas
Swarm,…)
(today SMPss, StarPU, Charm++, ParalleX,
Major Challenges are ahead for extreme
computing
 Power
 Levels of Parallelism
 Communication
 Hybrid
 Fault Tolerance
 … and many others not discussed here
Not just a programming assignment.
This opens up many new opportunities for
applied mathematicians and computer
scientists
Collaborators / Software / Support



•
PLASMA
http://icl.cs.utk.edu/plasma/
MAGMA
http://icl.cs.utk.edu/magma/
Quark (RT for Shared Memory)
http://icl.cs.utk.edu/quark/

PaRSEC(Parallel Runtime Scheduling
and Execution Control)
http://icl.cs.utk.edu/parsec/

•
Collaborating partners
University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver
INRIA, France
KAUST, Saudi Arabia
These tools are being applied to a range of applications beyond dense LA:
40
Sparse direct, Sparse iterations
methods and Fast Multipole Methods