TY - GEN
T1 - Deadline-Aware Offloading for High-Throughput Accelerators
AU - Yeh, Tsung Tai
AU - Sinclair, Matthew D.
AU - Beckmann, Bradford M.
AU - Rogers, Timothy G.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/2
Y1 - 2021/2
N2 - Contemporary GPUs are widely used for throughput-oriented data-parallel workloads and increasingly are being considered for latency-sensitive applications in datacenters. Examples include recurrent neural network (RNN) inference, network packet processing, and intelligent personal assistants. These data parallel applications have both high throughput demands and real-Time deadlines (40μs-7ms). Moreover, the kernels in these applications have relatively few threads that do not fully utilize the device unless a large batch size is used. However, batching forces jobs to wait, which increases their latency, especially when realistic job arrival times are considered.Previously, programmers have managed the tradeoffs associated with concurrent, latency-sensitive jobs by using a combination of GPU streams and advanced scheduling algorithms running on the CPU host. Although GPU streams allow the accelerator to execute multiple jobs concurrently, prior state-of-The-Art solutions use the relatively distant CPU host to prioritize the latency-sensitive GPU tasks. Thus, these approaches are forced to operate at a coarse granularity and cannot quickly adapt to rapidly changing program behavior.We observe that fine-grain, device-integrated kernel schedulers efficiently meet the deadlines of concurrent, latency-sensitive GPU jobs. To overcome the limitations of software-only, CPU-side approaches, we extend the GPU queue scheduler to manage real-Time deadlines. We propose a novel laxity-Aware scheduler (LAX) that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline. Compared to contemporary GPUs, 3 state-of-The-Art CPU-side schedulers and 6 other advanced GPU-side schedulers, LAX meets the deadlines of 1.7X-5.0X more jobs and provides better energy-efficiency, throughput, and 99-percentile tail latency.
AB - Contemporary GPUs are widely used for throughput-oriented data-parallel workloads and increasingly are being considered for latency-sensitive applications in datacenters. Examples include recurrent neural network (RNN) inference, network packet processing, and intelligent personal assistants. These data parallel applications have both high throughput demands and real-Time deadlines (40μs-7ms). Moreover, the kernels in these applications have relatively few threads that do not fully utilize the device unless a large batch size is used. However, batching forces jobs to wait, which increases their latency, especially when realistic job arrival times are considered.Previously, programmers have managed the tradeoffs associated with concurrent, latency-sensitive jobs by using a combination of GPU streams and advanced scheduling algorithms running on the CPU host. Although GPU streams allow the accelerator to execute multiple jobs concurrently, prior state-of-The-Art solutions use the relatively distant CPU host to prioritize the latency-sensitive GPU tasks. Thus, these approaches are forced to operate at a coarse granularity and cannot quickly adapt to rapidly changing program behavior.We observe that fine-grain, device-integrated kernel schedulers efficiently meet the deadlines of concurrent, latency-sensitive GPU jobs. To overcome the limitations of software-only, CPU-side approaches, we extend the GPU queue scheduler to manage real-Time deadlines. We propose a novel laxity-Aware scheduler (LAX) that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline. Compared to contemporary GPUs, 3 state-of-The-Art CPU-side schedulers and 6 other advanced GPU-side schedulers, LAX meets the deadlines of 1.7X-5.0X more jobs and provides better energy-efficiency, throughput, and 99-percentile tail latency.
KW - GPGPU
KW - job scheduling
KW - laxity
UR - http://www.scopus.com/inward/record.url?scp=85104991267&partnerID=8YFLogxK
U2 - 10.1109/HPCA51647.2021.00048
DO - 10.1109/HPCA51647.2021.00048
M3 - Conference contribution
AN - SCOPUS:85104991267
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 479
EP - 492
BT - Proceeding - 27th IEEE International Symposium on High Performance Computer Architecture, HPCA 2021
PB - IEEE Computer Society
T2 - 27th Annual IEEE International Symposium on High Performance Computer Architecture, HPCA 2021
Y2 - 27 February 2021 through 1 March 2021
ER -