CLIPREC: Graph-based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension

Jingcheng Ke, Jia Wang, Jun Cheng Chen, I. Hong Jhuo, Chia Wen Lin, Yen Yu Lin

Research output: Contribution to journalArticlepeer-review

Abstract

Referring expression comprehension (REC) is a cross-modal matching task that aims to localize the target object in an image specified by a text description. Most existing approaches for this task focus on identifying only objects whose categories are covered by training data. This restricts their generalization to unseen categories and practical usage. To address this issue, we propose a domain adaptive network called CLIPREC for zero-shot REC, which integrates the Contrastive Language-Image Pretraining (CLIP) model for graph-based REC. The proposed CLIPREC is composed of a graph collaborative attention module with two directed graphs: one for objects in an image and the other for their corresponding categorical labels. To carry out zero-shot REC, we leverage the strong common image-text feature space from the CLIP model to correlate the two graphs. Furthermore, a multilayer perceptron is introduced to enable feature alignment so that the CLIP model is adapted to the expression representation from the language parser, resulting in effective reasoning from expressions involving both seen and unseen object categories. Extensive experimental and ablation results on several widely-adopted benchmarks show that the proposed approach performs favorably against state-of-the-art approaches for zero-shot REC.

Original languageEnglish
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Multimedia
DOIs
StateAccepted/In press - 2023

Keywords

  • Adaptation models
  • Adaptive systems
  • CLIP
  • Cognition
  • domain adaptive network
  • Object detection
  • Referring expression comprehension
  • Task analysis
  • Training data
  • Visualization
  • zero-shot learning

Fingerprint

Dive into the research topics of 'CLIPREC: Graph-based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension'. Together they form a unique fingerprint.

Cite this