TY - GEN
T1 - Language-Guided Negative Sample Mining for Open-Vocabulary Object Detection
AU - Tseng, Yu Wen
AU - Shuai, Hong Han
AU - Huang, Ching Chun
AU - Li, Yung Hui
AU - Cheng, Wen Huang
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In the domain of computer vision, object detection serves as a fundamental perceptual task with critical implications. Traditional object detection frameworks are limited by their inability to recognize object classes not present in their training datasets, a significant drawback for practical applications where encountering novel objects is commonplace. To address the inherent lack of adaptability, more sophisticated paradigms such as zero-shot and open-vocabulary object detection have been introduced. Open-vocabulary object detection, in particular, often necessitates auxiliary image-text paired data to enhance model training. Our research proposes an innovative approach that refines the training process by mining potential unlabeled objects from negative sample pools. Leveraging a large-scale vision-language model, we harness the entropy of classification scores to selectively identify and annotate previously unlabeled samples, subsequently incorporating them into the training regimen. This novel methodology empowers our model to attain competitive performance benchmarks on the challenging MSCOCO dataset, matching state-of-the-art outcomes, while obviating the need for additional data or supplementary training procedures.
AB - In the domain of computer vision, object detection serves as a fundamental perceptual task with critical implications. Traditional object detection frameworks are limited by their inability to recognize object classes not present in their training datasets, a significant drawback for practical applications where encountering novel objects is commonplace. To address the inherent lack of adaptability, more sophisticated paradigms such as zero-shot and open-vocabulary object detection have been introduced. Open-vocabulary object detection, in particular, often necessitates auxiliary image-text paired data to enhance model training. Our research proposes an innovative approach that refines the training process by mining potential unlabeled objects from negative sample pools. Leveraging a large-scale vision-language model, we harness the entropy of classification scores to selectively identify and annotate previously unlabeled samples, subsequently incorporating them into the training regimen. This novel methodology empowers our model to attain competitive performance benchmarks on the challenging MSCOCO dataset, matching state-of-the-art outcomes, while obviating the need for additional data or supplementary training procedures.
KW - Open-vocabulary detection
KW - large vision-language model
KW - negative sample mining
UR - http://www.scopus.com/inward/record.url?scp=85189240042&partnerID=8YFLogxK
U2 - 10.1109/ICEIC61013.2024.10457133
DO - 10.1109/ICEIC61013.2024.10457133
M3 - Conference contribution
AN - SCOPUS:85189240042
T3 - 2024 International Conference on Electronics, Information, and Communication, ICEIC 2024
BT - 2024 International Conference on Electronics, Information, and Communication, ICEIC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Conference on Electronics, Information, and Communication, ICEIC 2024
Y2 - 28 January 2024 through 31 January 2024
ER -