TY - GEN
T1 - Pagoda
T2 - 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2017
AU - Yeh, Tsung Tai
AU - Sabne, Amit
AU - Sakdhnagool, Putt
AU - Eigenmann, Rudolf
AU - Rogers, Timothy G.
PY - 2017/1/26
Y1 - 2017/1/26
N2 - Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
AB - Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain < 500 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
KW - GPU runtime system
KW - Task parallelism
KW - Utilization
UR - http://www.scopus.com/inward/record.url?scp=85014435478&partnerID=8YFLogxK
U2 - 10.1145/3018743.3018754
DO - 10.1145/3018743.3018754
M3 - Conference contribution
AN - SCOPUS:85014435478
T3 - Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP
SP - 221
EP - 234
BT - PPoPP 2017 - Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
PB - Association for Computing Machinery
Y2 - 4 February 2017 through 8 February 2017
ER -