CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension

Jingcheng Ke, Jia Wang, Jun Cheng Chen*, I. Hong Jhuo, Chia Wen Lin, Yen Yu Lin

*此作品的通信作者

研究成果: Article同行評審

7 引文 斯高帕斯(Scopus)

摘要

Referring expression comprehension (REC) is a cross-modal matching task that aims to localize the target object in an image specified by a text description. Most existing approaches for this task focus on identifying only objects whose categories are covered by training data. This restricts their generalization to unseen categories and practical usage. To address this issue, we propose a domain adaptive network called CLIPREC for zero-shot REC, which integrates the Contrastive Language-Image Pretraining (CLIP) model for graph-based REC. The proposed CLIPREC is composed of a graph collaborative attention module with two directed graphs: one for objects in an image and the other for their corresponding categorical labels. To carry out zero-shot REC, we leverage the strong common image-text feature space from the CLIP model to correlate the two graphs. Furthermore, a multilayer perceptron is introduced to enable feature alignment so that the CLIP model is adapted to the expression representation from the language parser, resulting in effective reasoning from expressions involving both seen and unseen object categories. Extensive experimental and ablation results on several widely-adopted benchmarks show that the proposed approach performs favorably against state-of-the-art approaches for zero-shot REC.

原文English
頁(從 - 到)2480-2492
頁數13
期刊IEEE Transactions on Multimedia
26
DOIs
出版狀態Published - 2024

指紋

深入研究「CLIPREC: Graph-Based Domain Adaptive Network for Zero-Shot Referring Expression Comprehension」主題。共同形成了獨特的指紋。

引用此