A Bin-Based Indexing for Scalable Range Join on Genomic Data

Aman Sinha*, Bo Cheng Lai, Jhih Yong Mai

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Range-join is an operation for finding overlaps in interval-form genomic data. Range-join is widely used in various genome analysis processes such as annotation, filtering and comparison of variants in whole-genome and exome analysis pipelines. The quadratic complexity of current algorithms with sheer data volume has surged the design challenges. Existing tools have limitations on algorithm efficiency, parallelism, scalability and memory consumption. This paper proposes BIndex, a novel bin-based indexing algorithm and its distributed implementation to attain high throughput range-join processing. BIndex features near-constant search complexity while the inherently parallel data structure facilitates exploitation of parallel computing architectures. Balanced partitioning of dataset further enables scalability on distributed frameworks. The implementation on Message Passing Interface shows upto 933.5x speedup in comparison to state-of-the-art tools. Parallel nature of BIndex further enables GPU-based acceleration with 3.72x speedup than CPU implementations. The add-in modules for Apache Spark provides upto 4.65x speedup than the previously best available tool. BIndex supports wide variety of input and output formats prevalent in bioinformatics community and the algorithm is easily extendable to streaming data in recent Big Data solutions. Furthermore, the index data structure is memoryefficient and consumes upto two orders-of-magnitude lesser RAM, while having no adverse effect on speedup.

Original languageEnglish
Pages (from-to)2210-2222
Number of pages13
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Volume20
Issue number3
DOIs
StatePublished - May 2023

Keywords

  • Bioinformatics
  • distributed systems
  • genomics
  • interval join
  • parallel architectures
  • parallel systems
  • range join
  • tertiary analysis

Fingerprint

Dive into the research topics of 'A Bin-Based Indexing for Scalable Range Join on Genomic Data'. Together they form a unique fingerprint.

Cite this