Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding

Jia Wang, Hong Han Shuai, Yung Hui Li, Wen Huang Cheng*


研究成果: Article同行評審


Visual grounding is an essential task in understanding the semantic relationship between the given text description and the target object in an image. Due to the innate complexity of language and the rich semantic context of the image, it is still a challenging problem to infer the underlying relationship and to perform reasoning between the objects in an image and the given expression. Although existing visual grounding methods have achieved promising progress, cross-modal mapping across different domains for the task is still not well handled, especially when the expressions are complex and long. To address the issue, we propose a language-guided residual graph attention network for visual grounding (LRGAT-VG), which enables us to apply deeper graph convolution layers with the assistance of residual connections between them. This allows us to better handle long and complex expressions than other graph-based methods. Furthermore, we perform a Language-guided Data Augmentation (LGDA), which is based on copy-paste operations on pairs of source and target images to increase the diversity of training data while maintaining the relationship between the objects in the image and the expression. With extensive experiments on three visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, LRGAT-VG with LGDA achieves competitive performance with other state-of-the-art graph network-based referring expression approaches and demonstrates its effectiveness.

期刊ACM Transactions on Multimedia Computing, Communications and Applications
出版狀態Published - 24 8月 2023


深入研究「Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding」主題。共同形成了獨特的指紋。