Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).
Xiang SHEN
Shanghai Maritime University
Dezhi HAN
Shanghai Maritime University
Chin-Chen CHANG
Feng Chia University
Liang ZONG
Shaoyang University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Xiang SHEN, Dezhi HAN, Chin-Chen CHANG, Liang ZONG, "Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 4, pp. 785-796, April 2022, doi: 10.1587/transinf.2021EDP7189.
Abstract: Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7189/_p
Copy
@ARTICLE{e105-d_4_785,
author={Xiang SHEN, Dezhi HAN, Chin-Chen CHANG, Liang ZONG, },
journal={IEICE TRANSACTIONS on Information},
title={Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering},
year={2022},
volume={E105-D},
number={4},
pages={785-796},
abstract={Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).},
keywords={},
doi={10.1587/transinf.2021EDP7189},
ISSN={1745-1361},
month={April},}
Copy
TY - JOUR
TI - Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
T2 - IEICE TRANSACTIONS on Information
SP - 785
EP - 796
AU - Xiang SHEN
AU - Dezhi HAN
AU - Chin-Chen CHANG
AU - Liang ZONG
PY - 2022
DO - 10.1587/transinf.2021EDP7189
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2022
AB - Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).
ER -