TY - JOUR
T1 - Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding
AU - Wang, Jia
AU - Shuai, Hong Han
AU - Li, Yung Hui
AU - Cheng, Wen Huang
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/8/24
Y1 - 2023/8/24
N2 - Visual grounding is an essential task in understanding the semantic relationship between the given text description and the target object in an image. Due to the innate complexity of language and the rich semantic context of the image, it is still a challenging problem to infer the underlying relationship and to perform reasoning between the objects in an image and the given expression. Although existing visual grounding methods have achieved promising progress, cross-modal mapping across different domains for the task is still not well handled, especially when the expressions are complex and long. To address the issue, we propose a language-guided residual graph attention network for visual grounding (LRGAT-VG), which enables us to apply deeper graph convolution layers with the assistance of residual connections between them. This allows us to better handle long and complex expressions than other graph-based methods. Furthermore, we perform a Language-guided Data Augmentation (LGDA), which is based on copy-paste operations on pairs of source and target images to increase the diversity of training data while maintaining the relationship between the objects in the image and the expression. With extensive experiments on three visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, LRGAT-VG with LGDA achieves competitive performance with other state-of-the-art graph network-based referring expression approaches and demonstrates its effectiveness.
AB - Visual grounding is an essential task in understanding the semantic relationship between the given text description and the target object in an image. Due to the innate complexity of language and the rich semantic context of the image, it is still a challenging problem to infer the underlying relationship and to perform reasoning between the objects in an image and the given expression. Although existing visual grounding methods have achieved promising progress, cross-modal mapping across different domains for the task is still not well handled, especially when the expressions are complex and long. To address the issue, we propose a language-guided residual graph attention network for visual grounding (LRGAT-VG), which enables us to apply deeper graph convolution layers with the assistance of residual connections between them. This allows us to better handle long and complex expressions than other graph-based methods. Furthermore, we perform a Language-guided Data Augmentation (LGDA), which is based on copy-paste operations on pairs of source and target images to increase the diversity of training data while maintaining the relationship between the objects in the image and the expression. With extensive experiments on three visual grounding benchmarks, including RefCOCO, RefCOCO+, and RefCOCOg, LRGAT-VG with LGDA achieves competitive performance with other state-of-the-art graph network-based referring expression approaches and demonstrates its effectiveness.
KW - data augmentation
KW - Residual graph attention network
KW - visual grounding
UR - http://www.scopus.com/inward/record.url?scp=85173285136&partnerID=8YFLogxK
U2 - 10.1145/3604557
DO - 10.1145/3604557
M3 - Article
AN - SCOPUS:85173285136
SN - 1551-6857
VL - 20
JO - ACM Transactions on Multimedia Computing, Communications and Applications
JF - ACM Transactions on Multimedia Computing, Communications and Applications
IS - 1
M1 - 7
ER -