TY - GEN
T1 - Tolerating load miss latency by extending effective instruction window with low complexity
AU - Li, Walter Yuan Hwa
AU - Huang, Chin Ling
AU - Chung, Chung-Ping
PY - 2011
Y1 - 2011
N2 - An execute-ahead processor pre-executes instructions when a load miss would stall the processor. The typical design has several components that grow with the distance to execute ahead and need to be carefully balanced for optimal performance. This paper presents a novel approach which unifies those components and therefore is easy to implement and has no trouble to balance resource investment. When executing ahead, the processor enqueues (or preserves) all instructions along with the known execution results (including register and memory) in a preserving buffer (PB). When the leading load miss is resolved, the processor dequeues the instructions and then restores the known execution results or dispatch the instructions not yet executed. The implementation overheads include PB and a runahead cache for forwarding memory data. Only PB grows with the distance to execute ahead. This method can be applied to both in-order and out-of-order processors. Our experiments show that a four-way superscalar out-of-order processor with a 1 K-entry PB can have 15% and 120% speedup over the baseline design for SPEC INT2000 and SPEC FP2000 benchmark suites, assuming a 128-entry instruction window and a 300-cycle memory access latency.
AB - An execute-ahead processor pre-executes instructions when a load miss would stall the processor. The typical design has several components that grow with the distance to execute ahead and need to be carefully balanced for optimal performance. This paper presents a novel approach which unifies those components and therefore is easy to implement and has no trouble to balance resource investment. When executing ahead, the processor enqueues (or preserves) all instructions along with the known execution results (including register and memory) in a preserving buffer (PB). When the leading load miss is resolved, the processor dequeues the instructions and then restores the known execution results or dispatch the instructions not yet executed. The implementation overheads include PB and a runahead cache for forwarding memory data. Only PB grows with the distance to execute ahead. This method can be applied to both in-order and out-of-order processors. Our experiments show that a four-way superscalar out-of-order processor with a 1 K-entry PB can have 15% and 120% speedup over the baseline design for SPEC INT2000 and SPEC FP2000 benchmark suites, assuming a 128-entry instruction window and a 300-cycle memory access latency.
UR - http://www.scopus.com/inward/record.url?scp=80155183501&partnerID=8YFLogxK
U2 - 10.1109/ICPP.2011.73
DO - 10.1109/ICPP.2011.73
M3 - Conference contribution
AN - SCOPUS:80155183501
SN - 9780769545103
T3 - Proceedings of the International Conference on Parallel Processing
SP - 83
EP - 92
BT - Proceedings - 2011 International Conference on Parallel Processing, ICPP 2011
T2 - 40th International Conference on Parallel Processing, ICPP 2011
Y2 - 13 September 2011 through 16 September 2011
ER -