TY - GEN
T1 - Cache Capacity Aware Thread Scheduling for Irregular Memory Access on many-core GPGPUs
AU - Kuo, Hsien Kai
AU - Yen, Ta Kan
AU - Lai, Bo-Cheng
AU - Jou, Jing Yang
PY - 2013
Y1 - 2013
N2 - On-chip shared cache is effective to alleviate the memory bottleneck in modern many-core systems, such as GPGPUs. However, when scheduling numerous concurrent threads on a GPGPU, a cache capacity agnostic scheduling scheme could lead to severe cache contention among threads and thus significant performance degradation. Moreover, the diverse working sets in irregular applications make the cache contention issue an even more serious problem. As a result, taking cache capacity into account has become a critical scheduling issue of GPGPUs. This paper formulates a Cache Capacity Aware Thread Scheduling Problem to capture the impact of cache capacity as well as different architectural considerations. With a proof to be NP-hard, this paper has proposed two algorithms to perform the cache capacity aware thread scheduling. The simulation results on Nvidia's Fermi configuration have shown that the proposed scheduling scheme can effectively avoid cache contention, and achieve an average of 44.7% cache miss reduction and 28.5% runtime enhancement. The paper also shows the runtime can be enhanced up to 62.5% for more complex applications.
AB - On-chip shared cache is effective to alleviate the memory bottleneck in modern many-core systems, such as GPGPUs. However, when scheduling numerous concurrent threads on a GPGPU, a cache capacity agnostic scheduling scheme could lead to severe cache contention among threads and thus significant performance degradation. Moreover, the diverse working sets in irregular applications make the cache contention issue an even more serious problem. As a result, taking cache capacity into account has become a critical scheduling issue of GPGPUs. This paper formulates a Cache Capacity Aware Thread Scheduling Problem to capture the impact of cache capacity as well as different architectural considerations. With a proof to be NP-hard, this paper has proposed two algorithms to perform the cache capacity aware thread scheduling. The simulation results on Nvidia's Fermi configuration have shown that the proposed scheduling scheme can effectively avoid cache contention, and achieve an average of 44.7% cache miss reduction and 28.5% runtime enhancement. The paper also shows the runtime can be enhanced up to 62.5% for more complex applications.
UR - http://www.scopus.com/inward/record.url?scp=84877739484&partnerID=8YFLogxK
U2 - 10.1109/ASPDAC.2013.6509618
DO - 10.1109/ASPDAC.2013.6509618
M3 - Conference contribution
AN - SCOPUS:84877739484
SN - 9781467330299
T3 - Proceedings of the Asia and South Pacific Design Automation Conference, ASP-DAC
SP - 338
EP - 343
BT - 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
T2 - 2013 18th Asia and South Pacific Design Automation Conference, ASP-DAC 2013
Y2 - 22 January 2013 through 25 January 2013
ER -