# Document 192069

```How to Avoid Global Synchronization by
Domino Scheme
Lung-Sheng Chien, NVIDIA
[email protected]
Outline
What should be done in sparse direct solver
- what we have now on CPU
- what is low hanging fruit on GPU
Domino scheme
Conclusions
We have: Sparse direct solver
Work done by professor Timothy A. Davis
- UMFPACK: unsymmetric multifrontal sparse LU factorization package
- KLU: sparse LU factorization, for circuit simulation
- CHOLMOD: supernodal sparse Cholesky factorization and update/downdate.
Now it supports GPU
- SPQR: Sparse QR
Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR Factorization on GPU
Architectures, SIAM CSE 2013
S4204 - Multifrontal Sparse QR Factorization on the GPU, GTC2014
Low hanging fruit
Intercept each BLAS call and re-direct it to
GPU by
- re-designing the code and porting
cuBLAS, or
- using drop-in library, nvBLAS
Pros: easy, not error-prone
Cons: PCI bandwidth limits the
performance
12GB/s of PCI gen 3 versus 280GB/s of GTX TITAN
Sencer Nuri Yeralan, Timothy A. Davis, Sparse QR
Factorization on GPU Architectures, SIAM CSE 2013
Step back and rethink Cholesky
Cholesky factorization ( = ) does not need pivoting
Symbolic analysis can easily find out zero fill-in and figure out sparsity pattern of
Cholesky factor Reform original matrix with sparsity pattern of Cholesky factor Perform Cholesky factorization (the same as incomplete Cholesky factorization)
=
0
0
= + ()
B = 0()
ic0 = incomplete Cholesky factorization without zero fill-in
Benchmark: spd matrices
From Florida Matrix Collection
except Laplacian operator
Laplacian operator is standard
Finite Difference 3-pt, 5-pt and 7pt with Dirichlet boundary
condition
nnzA is # of nonzeros of A
nnzL is # of nonzero of L after
symamd
levels is # of levels in incomplete
Cholesky factorization
matrix name
m
nnzA
nnzL
nnzA/m nnzL/m levels
1138_bus.mtx
1138
2596
3264
2.3
2.9
40
aft01.mtx
8205
66886
304194
8.2
37.1
699
bcsstk13.mtx
2003
42943
289870
21.4
144.7
673
bcsstk34.mtx
588
11003
55120
18.7
93.7
364
bodyy4.mtx
17546
69742
584769
4.0
33.3
708
ex9.mtx
3363
51417
116942
15.3
34.8
662
LF10000.mtx
19998
59990
69988
3.0
3.5
19997
minsurfo.mtx
40806
122214 1038156
3.0
25.4
1177
nos6.mtx
675
1965
6200
2.9
9.2
101
s1rmt3m1.mtx
5489
112505
543103
20.5
98.9
1133
mhd4800b.mtx
4800
16160
16160
3.4
3.4
1198
Chem97ZtZ.mtx
2541
4951
4951
1.9
1.9
3
bcsstk26.mtx
1922
16129
42243
8.4
22.0
309
plat1919.mtx
1919
17159
66803
8.9
34.8
459
nasa1824.mtx
1824
20516
69926
11.2
38.3
299
ex33.mtx
1733
11961
33167
6.9
19.1
200
Muu.mtx
7102
88618
198606
12.5
28.0
323
sts4098.mtx
4098
38227
144225
9.3
35.2
371
bodyy5.mtx
18589
73935
627327
4.0
33.7
766
gyro_m.mtx
17361
178896
413709
10.3
23.8
489
Pres_Poisson.mtx
14822
365313 2447857
24.6
165.2
2758
Kuu.mtx
7102
173651
385828
24.5
54.3
612
shallow_water1.mtx
81920
204800 2448970
2.5
29.9
1428
finan512.mtx
74752
335872 2028289
4.5
27.1
1681
cant.mtx
62451 2034917 27456636
32.6
439.7
18072
lap1D_3pt_n1000000.mtx
1000000 1999999 1999999
2.0
2.0 1000000
lap2D_5pt_n100.mtx
10000
29800
25116
3.0
2.5
36
lap3D_7pt_n20.mtx
8000
30800
63281
3.9
7.9
141
http://www.cise.ufl.edu/research/sparse/matrices/
lap3D_7pt_n20
Original A
symamd(A)
L=chol(symamd(A))
domino-Cholesky versus CHOLMOD simplicial
Domino-Cholesky is 1.33x faster than CHOLMOD simplicial in average
Domino-Cholesky loses when
1) zero fill-in unevenly (load imbalance),
2) too many levels (more like sequential),
3) whole matrix can fit into CPU cache
GPU: K40
CPU: i7-3930K CPU @
3.20GHz
1.33x
SuiteSparse-3.6.0
/CHOLMOD/Demo
/cholmod_l_demo.c
domino-Cholesky versus CHOLMOD Supernodal
Domino-Cholesky is 4.7x faster than CHOLMOD supernodal in average
Domino-Cholesky wins if there are ONLY few big supernodes or it is NOT sequential
GPU: K40
CPU: i7-3930K CPU @
3.20GHz
SuiteSparse-3.6.0
/CHOLMOD/Demo
/cholmod_l_demo.c
Outline
What should be done in sparse direct solver
Domino scheme
- review scheduler of triangular solve
- how to trace DAG by a single kernel
Conclusions
Triangular solve: global sync
CUSPARSE library performs an
analysis phase, then a solve
phase
Maxim Naumov shows how to
extract parallelism and how to
keep utilization in GTC2012,
“On the Parallel Solution of
Sparse Triangular Linear System”
Global barrier between
consecutive two levels
Each level running in parallel by
atomic operations
Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012
Drawback of global synchronization
Critical path = max{level 1} + max{level 2} + max{level 3} = row 2 + row 4 + row 9
Avoid global synchronization
Critical path = row 1 + row 4 + row 9
Domino scheme
Motivation
Jonathan Hogg proposed a new trsv (dense triangular solve) in 2012
Applications
- csrsv (sparse triangular solve)
- csric0 (sparse incomplete Cholesky factorization)
- csrilu0 (sparse incomplete LU)
- sparse Cholesky
- sparse QR by householder reflection or Givens rotation
Idea
Each row is a domino. A domino waits until parents are done, and triggers next
dominos when it is done
Requirements
- to avoid deadlock, parents of a domino must either be done or be running
(a logical CTA id is required)
- whole matrix must fit into GPU
Jonathan’s trsv
DAG (implicit) is described by two constraints
- The work associated with a given off-diagonal block cannot begin until the solve for
the diagonal block in the column has completed
- The diagonal solve in a given row cannot commence until all matrix-vector multiplies
for blocks in that row have completed
One semaphore (lock) is used to schedule the DAG
- the semaphore is an integer indicating which row is done
One semaphore is used to query logical CTA id to avoid deadlock
CTA 0
CTA 0
time
S
CTA 1
CTA 1
mv
CTA 2
CTA 2
mv
mv
CTA 3
mv
mv
CTA 3
S
S
mv
S
Extend to sparse linear algebra
D1, D2 and D3 run in parallel and other
rows are waiting
Once D1 is done, it will trigger its
children, D4 and D5. Other rows are
still waiting
Each domino needs a semaphore (lock)
Pros and cons of Domino scheme
Pros
- hide global barrier inside the kernel
- overlap write phase and read phase of different dominos
- can run without levels
- reproducibility (determinism)
- quick analysis phase for csrsv/csric0/csrilu0
Cons
- long latency of L2 cache
- cannot adjust computational resources during the execution
- low utilization due to insufficient parallelism
Conclusions
Domino scheme can trace dependence graph, it works for dense/sparse linear algebra
Global synchronization is painful, there are several workarounds
- system level (across multiple nodes)
overlap MPI transfer and computation
- system level (single node)
overlap PCI transfer and computation
S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD
- GPU level
domino-scheme
Domino scheme still suffers
- long latency of L2 cache (hardware issue)
- not enough parallelism (characteristic of a matrix)
Penalty of domino is significant on csrsv
Batched domino is much faster (9x) because
- more work to do for each level
- penalty of domino becomes small
New domino-based functions in r6.0
- csrsv2
- csric02
- csrilu02
- bsrsv2
- bsric02
- bsrilu02
domino
Non-domino
Reproducibility
v
x
Quick analysis phase
v
x
Memory efficient
v
x
Report structural zero or
numerical zero
v
x
Buffer provided by users
v
x
Disable level scheduling
v
x
Sparse direct solver: QR
Symbolic analysis to find zero fill-in and dependence graph, then do left-looking
householder QR by domino scheme
QR has 9x more zero fill-in than Cholesky and is 14x slower than Cholesky
161 37
GPU: K40
Future work for sparse direct solver
Sparse QR, including householder and Givens rotation
- symbolic analysis on GPU (now on CPU)
- improve numerical factorization
Sparse LU with partial pivoting
Thank you !
[1] umfpack, cholmod, KLU, http://www.cise.ufl.edu/research/sparse/
[2] Maxim Naumov, On the Parallel Solution of Sparse Triangular Linear System, GTC2012
[3] J. D. Hogg, A Fast Dense Triangular Solve in CUDA,
SIAM Journal on Scientific Computing 2013, Vol. 35, No. 3, pp. C303-C322
[4] Maxim Naumov, What Does It Take to Accelerate SPICE on the GPU?, GTC2013
Related talks GTC2014
S4201 - GPU Acceleration of Sparse Matrix Factorization in CHOLMOD
Wednesday, 03/26, 14:00 - 14:50, Room LL20D
S4243 - A GPU Sparse Direct Solver for AX=B
Wednesday, 03/26, 09:00 - 09:25, Room LL20D
S4524 - Sparse LU Factorization on GPUs for Accelerating SPICE Simulation
Wednesday, 03/26, 09:30 - 09:55, Room LL20D
Domino-csrsv versus MKL
Domino-csrsv is 4.2x faster than MKL in average
4.2x
GPU: K40
CPU: i73930K CPU
@ 3.20GHz
Benchmark: unsymmetric matrices
From Florida Matrix Collection
except dense2.mtx
dense2.mtx is 2000x2000 dense
matrix
nnz is # of nonzeros of A
levels is # of levels in incomplete
LU factorization
matrix name
cage14.mtx
shermanACb.mtx
Chebyshev4.mtx
dense2.mtx
webbase-1M.mtx
cage12.mtx
lung2.mtx
cryg10000.mtx
ASIC_680ks.mtx
FEM_3D_thermal2.mtx
thermomech_dK.mtx
ASIC_320ks.mtx
cage13.mtx
atmosmodd.mtx
m
nnz
nnz/m levels
1505785 27130349
18.0
100
18510
145149
7.8
34
68121 5377761
78.9 17424
2000 4000000 2000.0
2000
1000005 3105536
3.1
512
130228 2032536
15.6
66
109460
492564
4.5
479
10000
49699
5.0
198
682712 2329176
3.4
49
147900 3489300
23.6
3757
204316 2846228
13.9
644
321671 1827807
5.7
35
445315 7479343
16.8
83
1270432 8814880
6.9
352
http://www.cise.ufl.edu/research/sparse/matrices/
No zero fill-in
domino-ilu0 versus MKL
Domino-ilu0 is 5.4x faster than MKL in average (exclude dense2.mtx)
GPU: K40
CPU: i73930K CPU
@ 3.20GHz
5.4x
Batched domino-ilu0
Same sparsity pattern but
different value
Batch size is 32
Speedup =
(32 * single ilu0)/(batched ilu0)
Batched domino-ilu0 is 9x faster
than 32 domino-ilu0 in average
GPU: K40
```