TY - GEN
T1 - Early load
T2 - 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008
AU - Chang, Shun Chieh
AU - Li, Walter Yuan Hwa
AU - Kuo, Yuan Jung
AU - Chung, Chung-Ping
PY - 2008
Y1 - 2008
N2 - Load instructions usually have long execution latency in a deep processor pipeline, and have significant impact on overall performance. Therefore, how to hide the load latency becomes a serious problem in processor design. The latency of memory load can be separated into two parts: cache-miss latency and load-to-use latency. Previous work which tried to hide the load latency in a deep processor pipeline has some limitations. In this paper, we propose a hardware-based method, called early load, to hide the load-to-use latency with little hardware overhead. Early load scheme allows load instructions to load data from the cache system before it enters the execution stage. In the meantime, a detection method makes sure the correctness of the early operation before the load instruction enters the execution stage. Our experimental results showed that our approach can achieve 11.64% performance improvement in Dhrystone benchmark and 4.97% in average for MiBench benchmark suite.
AB - Load instructions usually have long execution latency in a deep processor pipeline, and have significant impact on overall performance. Therefore, how to hide the load latency becomes a serious problem in processor design. The latency of memory load can be separated into two parts: cache-miss latency and load-to-use latency. Previous work which tried to hide the load latency in a deep processor pipeline has some limitations. In this paper, we propose a hardware-based method, called early load, to hide the load-to-use latency with little hardware overhead. Early load scheme allows load instructions to load data from the cache system before it enters the execution stage. In the meantime, a detection method makes sure the correctness of the early operation before the load instruction enters the execution stage. Our experimental results showed that our approach can achieve 11.64% performance improvement in Dhrystone benchmark and 4.97% in average for MiBench benchmark suite.
UR - http://www.scopus.com/inward/record.url?scp=55849146638&partnerID=8YFLogxK
U2 - 10.1109/APCSAC.2008.4625440
DO - 10.1109/APCSAC.2008.4625440
M3 - Conference contribution
AN - SCOPUS:55849146638
SN - 9781424426836
T3 - 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008
BT - 13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008
Y2 - 4 August 2008 through 6 August 2008
ER -