Deadline-Aware Offloading for High-Throughput Accelerators

Tsung Tai Yeh, Matthew D. Sinclair, Bradford M. Beckmann, Timothy G. Rogers

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

11 Scopus citations


Contemporary GPUs are widely used for throughput-oriented data-parallel workloads and increasingly are being considered for latency-sensitive applications in datacenters. Examples include recurrent neural network (RNN) inference, network packet processing, and intelligent personal assistants. These data parallel applications have both high throughput demands and real-Time deadlines (40μs-7ms). Moreover, the kernels in these applications have relatively few threads that do not fully utilize the device unless a large batch size is used. However, batching forces jobs to wait, which increases their latency, especially when realistic job arrival times are considered.Previously, programmers have managed the tradeoffs associated with concurrent, latency-sensitive jobs by using a combination of GPU streams and advanced scheduling algorithms running on the CPU host. Although GPU streams allow the accelerator to execute multiple jobs concurrently, prior state-of-The-Art solutions use the relatively distant CPU host to prioritize the latency-sensitive GPU tasks. Thus, these approaches are forced to operate at a coarse granularity and cannot quickly adapt to rapidly changing program behavior.We observe that fine-grain, device-integrated kernel schedulers efficiently meet the deadlines of concurrent, latency-sensitive GPU jobs. To overcome the limitations of software-only, CPU-side approaches, we extend the GPU queue scheduler to manage real-Time deadlines. We propose a novel laxity-Aware scheduler (LAX) that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline. Compared to contemporary GPUs, 3 state-of-The-Art CPU-side schedulers and 6 other advanced GPU-side schedulers, LAX meets the deadlines of 1.7X-5.0X more jobs and provides better energy-efficiency, throughput, and 99-percentile tail latency.

Original languageEnglish
Title of host publicationProceeding - 27th IEEE International Symposium on High Performance Computer Architecture, HPCA 2021
PublisherIEEE Computer Society
Number of pages14
ISBN (Electronic)9780738123370
StatePublished - Feb 2021
Event27th Annual IEEE International Symposium on High Performance Computer Architecture, HPCA 2021 - Virtual, Seoul, Korea, Republic of
Duration: 27 Feb 20211 Mar 2021

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
ISSN (Print)1530-0897


Conference27th Annual IEEE International Symposium on High Performance Computer Architecture, HPCA 2021
Country/TerritoryKorea, Republic of
CityVirtual, Seoul


  • job scheduling
  • laxity


Dive into the research topics of 'Deadline-Aware Offloading for High-Throughput Accelerators'. Together they form a unique fingerprint.

Cite this