TY - GEN
T1 - Capture Concept Through Comparison
T2 - 17th Asian Conference on Computer Vision, ACCV 2024
AU - Song, Yun Zhu
AU - Chen, Yi Syuan
AU - Lin, Tzu Ling
AU - Liu, Bei
AU - Fu, Jianlong
AU - Shuai, Hong Han
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Achieving alignment between vision and language semantics poses a critical challenge. Prior works have sought to enhance alignment by incorporating additional supervision, such as tags or object bounding boxes, as anchors between modalities. However, these methods predominantly concentrate on aligning tangible entities, disregarding other crucial abstract concepts that elude perception, such as side by side. To overcome this limitation, we propose a novel approach to Capture various Concepts through data Comparison (C3) for learning cross-modal representations. Specifically, we devise a data mining procedure to uncover intrinsic information within the database, avoiding the need for external annotations. Furthermore, we distinctly frame model inputs as triplets to better elucidate abstract semantics in images. Building upon this formulation, we propose two concept-centric pre-training objectives to signify concept learning. Extensive experiments show that models trained within the C3 framework consistently achieve significant enhancements across a wide range of comprehension and reasoning benchmarks, whether starting from scratch or fine-tuning from an existing model.
AB - Achieving alignment between vision and language semantics poses a critical challenge. Prior works have sought to enhance alignment by incorporating additional supervision, such as tags or object bounding boxes, as anchors between modalities. However, these methods predominantly concentrate on aligning tangible entities, disregarding other crucial abstract concepts that elude perception, such as side by side. To overcome this limitation, we propose a novel approach to Capture various Concepts through data Comparison (C3) for learning cross-modal representations. Specifically, we devise a data mining procedure to uncover intrinsic information within the database, avoiding the need for external annotations. Furthermore, we distinctly frame model inputs as triplets to better elucidate abstract semantics in images. Building upon this formulation, we propose two concept-centric pre-training objectives to signify concept learning. Extensive experiments show that models trained within the C3 framework consistently achieve significant enhancements across a wide range of comprehension and reasoning benchmarks, whether starting from scratch or fine-tuning from an existing model.
KW - Concept Learning
KW - Information Mining
KW - Vision-and-Language Learning
UR - http://www.scopus.com/inward/record.url?scp=85213057562&partnerID=8YFLogxK
U2 - 10.1007/978-981-96-0908-6_13
DO - 10.1007/978-981-96-0908-6_13
M3 - Conference contribution
AN - SCOPUS:85213057562
SN - 9789819609079
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 220
EP - 238
BT - Computer Vision – ACCV 2024 - 17th Asian Conference on Computer Vision, Proceedings
A2 - Cho, Minsu
A2 - Laptev, Ivan
A2 - Tran, Du
A2 - Yao, Angela
A2 - Zha, Hongbin
PB - Springer Science and Business Media Deutschland GmbH
Y2 - 8 December 2024 through 12 December 2024
ER -