A Bin-Based Indexing for Scalable Range Join on Genomic Data

Aman Sinha*, Bo Cheng Lai, Jhih Yong Mai

*此作品的通信作者

研究成果: Article同行評審

1 引文 斯高帕斯(Scopus)

摘要

Range-join is an operation for finding overlaps in interval-form genomic data. Range-join is widely used in various genome analysis processes such as annotation, filtering and comparison of variants in whole-genome and exome analysis pipelines. The quadratic complexity of current algorithms with sheer data volume has surged the design challenges. Existing tools have limitations on algorithm efficiency, parallelism, scalability and memory consumption. This paper proposes BIndex, a novel bin-based indexing algorithm and its distributed implementation to attain high throughput range-join processing. BIndex features near-constant search complexity while the inherently parallel data structure facilitates exploitation of parallel computing architectures. Balanced partitioning of dataset further enables scalability on distributed frameworks. The implementation on Message Passing Interface shows upto 933.5x speedup in comparison to state-of-the-art tools. Parallel nature of BIndex further enables GPU-based acceleration with 3.72x speedup than CPU implementations. The add-in modules for Apache Spark provides upto 4.65x speedup than the previously best available tool. BIndex supports wide variety of input and output formats prevalent in bioinformatics community and the algorithm is easily extendable to streaming data in recent Big Data solutions. Furthermore, the index data structure is memoryefficient and consumes upto two orders-of-magnitude lesser RAM, while having no adverse effect on speedup.

原文English
頁(從 - 到)2210-2222
頁數13
期刊IEEE/ACM Transactions on Computational Biology and Bioinformatics
20
發行號3
DOIs
出版狀態Published - 5月 2023

指紋

深入研究「A Bin-Based Indexing for Scalable Range Join on Genomic Data」主題。共同形成了獨特的指紋。

引用此