MVPX: A Media-oriented Vector Processing

Cyberscience Center, Tohoku University
c
i
MVPX: A Media-oriented Vector Processing Mechanism
Cyberscience Center
Background and Purpose
Music Players
Games
Animations
Issues of Conventional Approaches
Features of Next Generation MMAs
M ultiM edia A pplications (MMAs)
Recognition
Plenty of data level parallelism (DLP)
Difficult to efficiently execute MMAs of various vector length
Various vector length
Inefficient for MMAs with short vectors
Large amounts of data transmission
High power is consumed to achieve high data transmission ability
Proposal of this Research
Research Targets
Next generation MMAs are required to have:
To improve high computing power by using DLP
Higher quality of media processing
Use more computational-intensive algorithm
Process larger data sets
More Varieties of MMAs
Execute various MMAs on the same platform
Out-of-order vector processing mechanism
Focus on Vector Architectures
improve the performance of vector architecture on short vector processing
To improve the data transmission ability
Multi-banked cache memory
Focus on Memory Sub-system
Obtain a high capability of data transmission with lower power consumption
OVPM: an Out-of-Order (OoO) Vector Processing Mechanism
Behavior of In-order Issue and OVPM
MVL128
MVL256
MVL512
A Simple
Example
(MVL = Maximum Vector Length = Vector Register Length)
40%
30%
for( i = 0; i < N; i ++)
{
vload
va0, addr1
vload
va1, addr2
vadd
va2, va0, va1
vstore
va2, addr3
}
20%
10%
0%
sphinx3
faceRec
raytrace
vips
MxM
VxM
avg.
vips
MxM
VxM
Vector Length
4096
173
1080
79
1000
1000
• Most of the modern vector architectures obey the in-order instruction
issue policy
• In the case of executing MMAs with long vectors
Stalls caused by in order issue policy arehidden by using large vector registers
Long memory latencies arehidden by using large vector registers
Pipeline latencies of functional units arehidden by using large vector registers
• In the case of executing MMAs with short vectors
vload
VLSU
Pipeline
vadd
vstore
vstore
(a) Time-Space Diagram of vector extension of IVPM
when Executing the Program with Long Vectors
cycles
vload vload
vload vload
vadd
vload
Vector Memory
Instruction Buffer
Vector Ready
Instruction Buffer
Vector Load
& Store Queue
vload
vload
vadd
VFUs Pipeline
vadd
vstore
vstore
VLSU
Pipeline
(b) Time-Space Diagram of vector extension of IVPM
when Executing the Program with Short Vectors
VLSU
Pipeline
vadd
vstore
VLSU
Pipeline
Experimental Setup
Vector Load
Store Unit
Enhanced with vector extension
Parsec Benchmark Suite
ALPbench Benchmark Suite
Vector Memory Instruction Buffer (VMIB): OoO processing for Memory Instr.
Vector Arithmetic Instruction Buffer (VAIB ): OoO processing for Arithmetic Inst.
(b) Time-Space Diagram of vector extension of OVPM
when Executing the Program with Short Vectors
IVPM
• Benchmarks
• Add two new instruction buffers to realize OVPM
Exe. Cycles Reduced
vstore
OVPM
90%
Simplescalar Toolset
Main Memory
vadd
vstore
VLSU
Pipeline
Impacts of OVPM
• Simulator development
MVP-cache
vadd
VFUs Pipeline
vstore
(c) Time-Space Diagram of vector extension of OVPM
when Executing the Program withLong Vectors
Pipeline stage
cycles
vload vload vload
vload
Exe. Cycles Reduced
vadd
VFUs Pipeline
Vector
Registers
D Cache
VLSU
Pipeline
VLSU
Pipeline
Vector Arithmetic
Instruction Buffer
Vector Function
Units
LSU
vload
cycles
Instruction FetchQueue
FUs
vload
vload
A stall is caused due to the in-order issue policy
cycles
These vloads overtake thevadd and vstore in the previous iteration, respectively, due toOoO processing
OVPM
General PurposeRegisters
Pipeline
Stages
Min(MVL, VL)
Number of parallel pipelines
A stall is caused due to the in-order issue policy
vload
General Purpose Processor
Decoder
Simplify
The Time-Space Diagram of the Simple Example
Pipeline stage
Microarchitecture of OVPM
Fetcher
Pipeline
Latency
VLSU
Pipeline
An OoO issue policy is required for vector architectures,
in order to execute MMAs with short vectors efficiently
……
……
VFUs Pipeline
Stalls caused by in order issue policy areexposed
Memory latencies areexposed
I Cache
Pipeline
Stages
…
raytrace
Cycles
…
faceRec
Cycles
……
…
Sphinx3
Data
Parallel Pipelines
…
Benchmarks
OVPM: OoO Vector Processing Mechanism
IVPM: In-order Vector Processing Mechanism
Time-Space Diagram
Pipeline stage
50%
MVL64
Pipeline stage
MVL32
Vector ALU latency
10 cycles
Vector Multiplier latency
15 cycles
Vector Division latency
50 cycles
Number of Parallel Pipelined
VFUs
8
Main memory latency
100 cycles
Entries per Vector Register
128 entries
Frequency
3GHz
Computational Efficiency
Computational
Efficiency
Issues on Conventional Vector Processors
80%
70%
60%
50%
40%
30%
20%
10%
0%
sphinx
face
ray
vips
MxM
VxM
The computational efficiency of IVPM achieves 17%, while that of OVPM achieves 55.2%
The computational efficiencies improve, especially for the MMAs with short vectors
Both of MMAs with short vectors and long vectors achieve high utilization of hardware
MVP-Cache: A High Bandwidth Cache System
Interconnection
3000
2000
1000
16
0
Large overheads on increasing the memory bandwidth
It is necessary to propose a high bandwidth cache
system to increase the effective memory bandwidth
MVP-cache
4000
2
4
8
Number of Memory Ports
2.00
bank
0
…
bank
m-1
Bus
Memory Channel
0
…
…
bank
(n-1)•m
bank
… n •m - 1
Bus
Memory Channel
n-1
Achieve high bandwidth by accessing multiple
independent banks concurrently
Hide the access latencies by using the interleaved
memory access method
SC13 Denver, Colorado
1.76
1.50
1.64
1.00
1.00
0.99
1.26
1 bank
1.33
sphinx ray
Speedup =
vips VxM
(8 byte/cycle)
1.33
0.50
0.00
Impacts of Cache Bandwidth
face MxM avg.
Exe. Time of OVPM w/o MVP-cache
Exe. Time of OVPM w/ MVP-cache
Cache bandwidth is twice higher than Memory bandwidth
1.33x performance improvement ( 2x improvement in theory )
MVP-cache bridges the gap between main memory and OVPM
12
Relative Performance
Area (mm^2)
5000
1
Performance Evaluation of MVP-cache
OVPM
Speedup
Peak Dynamic (W)
400
350
300
250
200
150
100
50
0
MVP-cache
Peak Dynamic Power of
Memory Ports (W)
Area of Memory Ports (mm2)
Costs of Increasing Memory Bandwidth
2 banks
4 banks
face
ray
8 banks
16 banks
(16 byte/cycle) (32 byte/cycle) (64 byte/cycle) (128 byte/cycle)
32 banks
(256 byte/cycle)
10
8
6
4
2
0
sphinx
vips
MxM
VxM
avg.
Relative performance is normalized by 1 bank $
Most of MMAs are sensitive to cache bandwidth
The higher cache hit rate is, the more performance is improved
(URL) http://www.sc.isc.tohoku.ac.jp/
(E-mail) [email protected]