Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.
Huimin LI
Shanghai Maritime University
Dezhi HAN
Shanghai Maritime University
Chongqing CHEN
Shanghai Maritime University
Chin-Chen CHANG
Feng Chia University
Kuan-Ching LI
Providence University
Dun LI
Shanghai Maritime University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Huimin LI, Dezhi HAN, Chongqing CHEN, Chin-Chen CHANG, Kuan-Ching LI, Dun LI, "A Visual Question Answering Network Merging High- and Low-Level Semantic Information" in IEICE TRANSACTIONS on Information,
vol. E106-D, no. 5, pp. 581-589, May 2023, doi: 10.1587/transinf.2022DLP0002.
Abstract: Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022DLP0002/_p
Copy
@ARTICLE{e106-d_5_581,
author={Huimin LI, Dezhi HAN, Chongqing CHEN, Chin-Chen CHANG, Kuan-Ching LI, Dun LI, },
journal={IEICE TRANSACTIONS on Information},
title={A Visual Question Answering Network Merging High- and Low-Level Semantic Information},
year={2023},
volume={E106-D},
number={5},
pages={581-589},
abstract={Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.},
keywords={},
doi={10.1587/transinf.2022DLP0002},
ISSN={1745-1361},
month={May},}
Copy
TY - JOUR
TI - A Visual Question Answering Network Merging High- and Low-Level Semantic Information
T2 - IEICE TRANSACTIONS on Information
SP - 581
EP - 589
AU - Huimin LI
AU - Dezhi HAN
AU - Chongqing CHEN
AU - Chin-Chen CHANG
AU - Kuan-Ching LI
AU - Dun LI
PY - 2023
DO - 10.1587/transinf.2022DLP0002
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E106-D
IS - 5
JA - IEICE TRANSACTIONS on Information
Y1 - May 2023
AB - Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.
ER -