A Visual Question Answering Network Merging High- and Low-Level Semantic Information

Huimin LI; Dezhi HAN; Chongqing CHEN; Chin-Chen CHANG; Kuan-Ching LI; Dun LI

doi:10.1587/transinf.2022DLP0002

IEICE TRANSACTIONS on Information

A Visual Question Answering Network Merging High- and Low-Level Semantic Information

Huimin LI, Dezhi HAN, Chongqing CHEN, Chin-Chen CHANG, Kuan-Ching LI, Dun LI

Full Text Views

1

Cite this

Summary :

Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.

Publication: IEICE TRANSACTIONS on Information Vol.E106-D No.5 pp.581-589

Publication Date: 2023/05/01

Publicized: 2022/01/06

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2022DLP0002

Type of Manuscript: Special Section PAPER (Special Section on Deep Learning Technologies: Architecture, Optimization, Techniques, and Applications)

Category: Core Methods

Authors

Huimin LI
  Shanghai Maritime University
Dezhi HAN
  Shanghai Maritime University
Chongqing CHEN
  Shanghai Maritime University
Chin-Chen CHANG
  Feng Chia University
Kuan-Ching LI
  Providence University
Dun LI
  Shanghai Maritime University

Keyword

Visual Question Answering (VQA), deep attention mechanisms, adaptive weight learning, gate-sum mechanism

Cite this

Copy

Huimin LI, Dezhi HAN, Chongqing CHEN, Chin-Chen CHANG, Kuan-Ching LI, Dun LI, "A Visual Question Answering Network Merging High- and Low-Level Semantic Information" in IEICE TRANSACTIONS on Information, vol. E106-D, no. 5, pp. 581-589, May 2023, doi: 10.1587/transinf.2022DLP0002.
Abstract: Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022DLP0002/_p

Copy

@ARTICLE{e106-d_5_581,
author={Huimin LI, Dezhi HAN, Chongqing CHEN, Chin-Chen CHANG, Kuan-Ching LI, Dun LI, },
journal={IEICE TRANSACTIONS on Information},
title={A Visual Question Answering Network Merging High- and Low-Level Semantic Information},
year={2023},
volume={E106-D},
number={5},
pages={581-589},
abstract={Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.},
keywords={},
doi={10.1587/transinf.2022DLP0002},
ISSN={1745-1361},
month={May},}

Copy

TY - JOUR
TI - A Visual Question Answering Network Merging High- and Low-Level Semantic Information
T2 - IEICE TRANSACTIONS on Information
SP - 581
EP - 589
AU - Huimin LI
AU - Dezhi HAN
AU - Chongqing CHEN
AU - Chin-Chen CHANG
AU - Kuan-Ching LI
AU - Dun LI
PY - 2023
DO - 10.1587/transinf.2022DLP0002
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E106-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 2023
AB - Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.
ER -

IEICE TRANSACTIONS on Information