The search functionality is under construction.
The search functionality is under construction.

Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering

Xiang SHEN, Dezhi HAN, Chin-Chen CHANG, Liang ZONG

  • Full Text Views

    0

  • Cite this

Summary :

Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).

Publication
IEICE TRANSACTIONS on Information Vol.E105-D No.4 pp.785-796
Publication Date
2022/04/01
Publicized
2022/01/06
Online ISSN
1745-1361
DOI
10.1587/transinf.2021EDP7189
Type of Manuscript
PAPER
Category
Natural Language Processing

Authors

Xiang SHEN
  Shanghai Maritime University
Dezhi HAN
  Shanghai Maritime University
Chin-Chen CHANG
  Feng Chia University
Liang ZONG
  Shaoyang University

Keyword