Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks

Jia Wang, Jingcheng Ke, Hong Han Shuai, Yung Hui Li, Wen Huang Cheng*

*此作品的通信作者

研究成果: Article同行評審

7 引文 斯高帕斯(Scopus)

摘要

Referring expression comprehension aims to localize a specific object in an image according to a given language description. It is still challenging to comprehend and mitigate the gap between various types of information in the visual and textual domains. Generally, it needs to extract the salient features from a given expression and match the features of expression to an image. One challenge in referring expression comprehension is the number of region proposals generated by object detection methods is far more than the number of entities in the corresponding language description. Remarkably, the candidate regions without described by the expression will bring a severe impact on referring expression comprehension. To tackle this problem, we first propose a novel Enhanced Cross-modal Graph Attention Networks (ECMGANs) that boosts the matching between the expression and the entity position of an image. Then, an effective strategy named Graph Node Erase (GNE) is proposed to assist ECMGANs in eliminating the effect of irrelevant objects on the target object. Experiments on three public referring expression comprehension datasets show unambiguously that our ECMGANs framework achieves better performance than other state-of-The-Art methods. Moreover, GNE is able to obtain higher accuracies of visual-expression matching effectively.

原文English
文章編號65
期刊ACM Transactions on Multimedia Computing, Communications and Applications
19
發行號2
DOIs
出版狀態Published - 6 2月 2023

指紋

深入研究「Referring Expression Comprehension Via Enhanced Cross-modal Graph Attention Networks」主題。共同形成了獨特的指紋。

引用此