Capture Concept Through Comparison: Vision-and-Language Representation Learning with Intrinsic Information Mining

Yun Zhu Song, Yi Syuan Chen, Tzu Ling Lin, Bei Liu, Jianlong Fu, Hong Han Shuai*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Achieving alignment between vision and language semantics poses a critical challenge. Prior works have sought to enhance alignment by incorporating additional supervision, such as tags or object bounding boxes, as anchors between modalities. However, these methods predominantly concentrate on aligning tangible entities, disregarding other crucial abstract concepts that elude perception, such as side by side. To overcome this limitation, we propose a novel approach to Capture various Concepts through data Comparison (C3) for learning cross-modal representations. Specifically, we devise a data mining procedure to uncover intrinsic information within the database, avoiding the need for external annotations. Furthermore, we distinctly frame model inputs as triplets to better elucidate abstract semantics in images. Building upon this formulation, we propose two concept-centric pre-training objectives to signify concept learning. Extensive experiments show that models trained within the C3 framework consistently achieve significant enhancements across a wide range of comprehension and reasoning benchmarks, whether starting from scratch or fine-tuning from an existing model.

Original languageEnglish
Title of host publicationComputer Vision – ACCV 2024 - 17th Asian Conference on Computer Vision, Proceedings
EditorsMinsu Cho, Ivan Laptev, Du Tran, Angela Yao, Hongbin Zha
PublisherSpringer Science and Business Media Deutschland GmbH
Pages220-238
Number of pages19
ISBN (Print)9789819609079
DOIs
StatePublished - 2025
Event17th Asian Conference on Computer Vision, ACCV 2024 - Hanoi, Viet Nam
Duration: 8 Dec 202412 Dec 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume15474 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th Asian Conference on Computer Vision, ACCV 2024
Country/TerritoryViet Nam
CityHanoi
Period8/12/2412/12/24

Keywords

  • Concept Learning
  • Information Mining
  • Vision-and-Language Learning

Fingerprint

Dive into the research topics of 'Capture Concept Through Comparison: Vision-and-Language Representation Learning with Intrinsic Information Mining'. Together they form a unique fingerprint.

Cite this