An MDL-Based Genetic Algorithm for Genome Sequence Compression

M. Zohaib Nawaz, M. Saqib Nawaz, Philippe Fournier-Viger*, Vincent S. Tseng

*此作品的通信作者

研究成果: Conference contribution同行評審

1 引文 斯高帕斯(Scopus)

摘要

The exponential growth of genomic data has posed significant challenges for lossless compression of genome sequences. While recent reference-free genome compressors have shown promising results, they often fail to fully leverage the inherent sequential structure of genome sequences, require substantial computational resources and lack (or have limited) interpretability. This paper presents a novel genome compression method that employs the Minimum Description Length (MDL) principle, which is based on the idea that the best model for a given dataset is the one that provides the shortest description of that dataset. The proposed compressor, called GMG (Genetic algorithm for MDL-based Genome compression), integrates a genetic algorithm to identify optimal k-mers (patterns) in a model to best compress the genome data. Experimental results across various datasets demonstrate that GMG outperforms state-of-the-art genome compressors in terms of bits-per-base compression and computational efficiency. Furthermore, it is demonstrated that the optimal patterns identified by GMG for compression can also be utilized for genome classification, offering a multifunctional advantage over previous compressors. GMG is freely available at github.com/MuhammadzohaibNawaz/GMG

原文English
主出版物標題Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
編輯Mario Cannataro, Huiru Zheng, Lin Gao, Jianlin Cheng, Joao Luis de Miranda, Ester Zumpano, Xiaohua Hu, Young-Rae Cho, Taesung Park
發行者Institute of Electrical and Electronics Engineers Inc.
頁面6724-6731
頁數8
ISBN(電子)9798350386226
DOIs
出版狀態Published - 2024
事件2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024 - Lisbon, 葡萄牙
持續時間: 3 12月 20246 12月 2024

出版系列

名字Proceedings - 2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024

Conference

Conference2024 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2024
國家/地區葡萄牙
城市Lisbon
期間3/12/246/12/24

指紋

深入研究「An MDL-Based Genetic Algorithm for Genome Sequence Compression」主題。共同形成了獨特的指紋。

引用此