Address-Value Delta (AVD)Prediction
Onur MutluHyesoon KimYale N. Patt
AVD Prediction
2
What is AVD Prediction?
A new prediction techniqueused to break the data dependencies betweendependent load instructions
AVD Prediction
3
Talk Outline
Background on Runahead ExecutionThe Problem: Dependent Cache MissesAVD PredictionWhy Does It Work?EvaluationConclusions
AVD Prediction
4
Background on Runahead Execution
A technique to obtain the memory-level parallelism benefits of a large instruction windowWhen the oldest instruction is an L2 miss:Checkpoint architectural state and enter runahead modeIn runahead mode:Instructions are speculatively pre-executedThe purpose of pre-execution is to generate prefetchesL2-miss dependent instructions are marked INV and droppedRunahead mode ends when the original L2 miss returnsCheckpoint is restored and normal execution resumes
AVD Prediction
5
Works when Load 1 and 2 are independent
Compute
Compute
Load 1 Miss
Miss 1
Stall
Compute
Load 2 Miss
Miss 2
Stall
Load 1 Miss
Runahead
Load 2 Miss
Load 2 Hit
Miss 1
Miss 2
Compute
Load 1 Hit
Saved Cycles
Small Window:
Runahead:
Runahead Example
AVD Prediction
6
Runahead execution cannot parallelize dependent missesThis limitation results inwasted opportunity to improve performancewasted energy (useless pre-execution)Runahead performance would improve by 25% if this limitation were ideally overcome
The Problem: Dependent Cache Misses
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2
Load 1 Hit
Runahead:Load 2 isdependenton Load 1
Runahead
CannotCompute Its Address!
INV
AVD Prediction
7
The Goal
Enable the parallelization of dependent L2 cache misses in runahead modewith a low-cost mechanismHow:Predict the values of L2-missaddress (pointer) loadsAddress load: loads an address into its destination register, which is later used to calculate the address of another loadas opposed todata load
AVD Prediction
8
Parallelizing Dependent Misses
Compute
Load 1 Miss
Miss 1
Load 2 Hit
Miss 2
Load 2
Load 1 Hit
Value Predicted
Runahead
Saved Cycles
CanCompute Its Address
Compute
Load 1 Miss
Miss 1
Load 2 Miss
Miss 2
Load 2INV
Load 1 Hit
Runahead
CannotCompute Its Address!
Saved Speculative Instructions
Miss
AVD Prediction
9
A Question
How can we predict the values of address loadswithlow hardware cost and complexity?
AVD Prediction
10
Talk Outline
Background on Runahead ExecutionThe Problem: Dependent Cache MissesAVD PredictionWhy Does It Work?EvaluationConclusions
AVD Prediction
11
The Solution: AVD Prediction
Address-value delta (AVD) of a load instruction defined as:AVD = EffectiveAddressof Load–DataValueof LoadFor some address loads, AVD is stableAn AVD predictor keeps track of the AVDs of address loadsWhen a load is an L2 miss in runahead mode, AVD predictor is consultedIf the predictor returns a stable (confident) AVD for that load, the value of the load is predictedPredicted Value = Effective Address–Predicted AVD
AVD Prediction
12
Identifying Address Loads in Hardware
Insight:If the AVD is too large, the value that is loaded is likelynotan addressOnly keep track of loads that satisfy:-MaxAVD≤AVD≤+MaxAVDThis identification mechanism eliminates many loads from considerationEnables the AVD predictor to be small
AVD Prediction
13
An Implementable AVD Predictor
Set-associative prediction tablePrediction table entry consists ofTag (Program Counter of the load)Last AVD seen for the loadConfidence counter for the recorded AVDUpdated when an address load is retired in normal modeAccessed when a load misses in L2 cache in runahead modeRecovery-free:No need to recover the state of the processor or the predictor on mispredictionRunahead mode is purely speculative
AVD Prediction
14
AVD Update Logic
AVD Prediction
15
AVD Prediction Logic
AVD Prediction
16
Talk Outline
Background on Runahead ExecutionThe Problem: Dependent Cache MissesAVD PredictionWhy Does It Work?EvaluationConclusions
AVD Prediction
17
Why Do Stable AVDs Occur?
Regularity in the way data structures areallocated in memoryANDtraversedTwo types of loads can have stable AVDsTraversal address loadsProduce addresses consumed byaddress loadsLeaf address loadsProduce addresses consumed bydata loads
AVD Prediction
18
Traversal Address Loads
Regularly-allocated linked list:
A
A+k
A+2k
A+3k
A+4k
A+5k
...
Atraversal address loadloads the pointer to next node:node = nodenext
Effective Addr
Data Value
AVD
A
A+k
-k
A+k
A+2k
-k
A+2k
A+3k
-k
A+3k
A+4k
-k
A+4k
A+5k
-k
Stable AVD
Striding data value
AVD = Effective Addr – Data Value
AVD Prediction
19
Stable AVDs can be captured with astride value predictorStable AVDs disappear with there-organization of the data structure(e.g., sorting)Stability of AVDs is dependent on the behavior of thememory allocatorAllocation of contiguous, fixed-size chunks is useful
Properties of Traversal-based AVDs
A
A+k
A+2k
A+3k
A+3k
A+k
A
A+2k
Sorting
Distance betweennodes NOT constant!
AVD Prediction
20
Leaf Address Loads
Sorted dictionary inparser:Nodespoint tostrings(words) String and node allocated consecutively
A+k
A
C+k
C
B+k
B
D+k
E+k
F+k
G+k
D
E
F
G
Dictionary looked up for an input word.Aleaf address loadloads the pointer to the string of each node:
Effective Addr
Data Value
AVD
A+k
A
k
C+k
C
k
F+k
F
k
lookup (node, input) { // ...ptr_str = nodestring;m = check_match(ptr_str, input); if (m>=0) lookup(node->right, input); if (m<0) lookup(node->left, input); }
Stable AVD
No stride!
AVD = Effective Addr – Data Value
string
node
AVD Prediction
21
Properties of Leaf-based AVDs
Stable AVDscannotbe captured with a stride value predictorStable AVDsdo notdisappearwith the re-organization of the data structure (e.g., sorting)Stability of AVDs is dependent on the behavior of thememory allocator
A+k
A
B+k
B
C
C+k
Sorting
Distance betweennodeandstringstill constant!
C+k
C
A+k
A
B
B+k
AVD Prediction
22
Talk Outline
Background on Runahead ExecutionThe Problem: Dependent Cache MissesAVD PredictionWhy Does It Work?EvaluationConclusions
AVD Prediction
23
Baseline Processor
Execution-driven Alpha simulator8-wide superscalar processor128-entry instruction window, 20-stage pipeline64 KB, 4-way, 2-cycle L1 data and instruction caches1 MB, 32-way, 10-cycle unified L2 cache500-cycle minimum main memory latency32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding missesDetailed memory modelPointer-intensive benchmarks from Olden and SPEC INT00
AVD Prediction
24
Performance of AVD Prediction
12.1%
AVD Prediction
25
Effect on Executed Instructions
13.3%
AVD Prediction
26
AVD Prediction vs. Stride Value Prediction
Performance:Both can capture traversal address loads with stable AVDse.g., treeaddStride VP cannot captureleaf address loads with stable AVDse.g., health, mst, parserAVD predictor cannot capturedata loads with striding data valuesPredicting these can be useful for the correct resolution of mispredicted L2-miss dependent branches, e.g., parserComplexity:AVD predictor requires much fewer entries (only address loads)AVD prediction logic is simpler (no stride maintenance)
AVD Prediction
27
AVD vs. Stride VP Performance
5.1%
2.7%
6.5%
5.5%
4.7%
8.6%
16 entries
4096 entries
AVD Prediction
28
Conclusions
Runahead execution is unable to parallelizedependent L2 cache missesAvery simple, 16-entry (102-byte) AVDpredictorreduces this limitation on pointer-intensive applicationsIncreases runahead execution performance by 12.1%Reduces executed instructions by 13.3%AVD prediction takes advantage of theregularity in the memory allocation patternsof programsSoftware (programs, compilers, memory allocators) can be written to take advantage of AVD prediction
Backup Slides
AVD Prediction
30
The Potential: What if it Could?
25%
27%
AVD Prediction
31
Effect of Confidence Threshold
AVD Prediction
32
Effect of MaxAVD
AVD Prediction
33
Effect of Memory Latency
8%
12.1%
13.5%
9.3%
13%
0
Embed
Upload