Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

Jia Wang, Jingcheng Ke, Hong Han Shuai, Yung Hui Li, Wen Huang Cheng*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

Referring expression comprehension aims to localize a specific object in an image according to a given language description. It is still challenging to comprehend and mitigate the gap between various types of information in the visual and textual domains. Generally, it needs to extract the salient features from a given expression and match the features of expression to an image. One challenge in referring expression comprehension is the number of region proposals generated by object detection methods is far more than the number of entities in the corresponding language description. Remarkably, the candidate regions without described by the expression will bring a severe impact on referring expression comprehension. To tackle this problem, we first propose a novel Enhanced Cross-modal Graph Attention Networks (ECMGANs) that boosts the matching between the expression and the entity position of an image. Then, an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object. Experiments on three public referring expression comprehension datasets show unambiguously that our ECMGANs framework achieves better performance than other state-of-The-Art methods. Moreover, GNE is able to obtain higher accuracies of visual-expression matching effectively.

Original languageEnglish
Article number65
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume19
Issue number2
DOIs
StatePublished - 6 Feb 2023

Keywords

  • Enhanced Cross-modal Graph Attention Networks
  • Graph Node Erase
  • Referring expression comprehension
  • object detection

Fingerprint

Dive into the research topics of 'Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks'. Together they form a unique fingerprint.

Cite this