Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding

Jia Wang, Hong Han Shuai, Yung Hui Li, Wen Huang Cheng*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


Visual grounding is an essential task in understanding the semantic relationship between the given text description and the target object in an image. Due to the innate complexity of language and the rich semantic context of the image, it is still a challenging problem to infer the underlying relationship and to perform reasoning between the objects in an image and the given expression. Although existing visual grounding methods have achieved promising progress, cross-modal mapping across different domains for the task is still not well handled, especially when the expressions are complex and long. To address the issue, we propose a language-guided residual graph attention network for visual grounding (LRGAT-VG), which enables us to apply deeper graph convolution layers with the assistance of residual connections between them. This allows us to better handle long and complex expressions than other graph-based methods. Furthermore, we perform a Language-guided Data Augmentation (LGDA), which is based on copy-paste operations on pairs of source and target images to increase the diversity of training data while maintaining the relationship between the objects in the image and the expression. With extensive experiments on three visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, LRGAT-VG with LGDA achieves competitive performance with other state-of-the-art graph network-based referring expression approaches and demonstrates its effectiveness.

Original languageEnglish
Article number7
JournalACM Transactions on Multimedia Computing, Communications and Applications
Issue number1
StatePublished - 24 Aug 2023


  • data augmentation
  • Residual graph attention network
  • visual grounding


Dive into the research topics of 'Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding'. Together they form a unique fingerprint.

Cite this