Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.
Rizal Setya PERDANA
Toyohashi University of Technology,Universitas Brawijaya
Yoshiteru ISHIDA
Toyohashi University of Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Rizal Setya PERDANA, Yoshiteru ISHIDA, "Vision-Text Time Series Correlation for Visual-to-Language Story Generation" in IEICE TRANSACTIONS on Information,
vol. E104-D, no. 6, pp. 828-839, June 2021, doi: 10.1587/transinf.2020EDP7131.
Abstract: Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020EDP7131/_p
Copy
@ARTICLE{e104-d_6_828,
author={Rizal Setya PERDANA, Yoshiteru ISHIDA, },
journal={IEICE TRANSACTIONS on Information},
title={Vision-Text Time Series Correlation for Visual-to-Language Story Generation},
year={2021},
volume={E104-D},
number={6},
pages={828-839},
abstract={Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.},
keywords={},
doi={10.1587/transinf.2020EDP7131},
ISSN={1745-1361},
month={June},}
Copy
TY - JOUR
TI - Vision-Text Time Series Correlation for Visual-to-Language Story Generation
T2 - IEICE TRANSACTIONS on Information
SP - 828
EP - 839
AU - Rizal Setya PERDANA
AU - Yoshiteru ISHIDA
PY - 2021
DO - 10.1587/transinf.2020EDP7131
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2021
AB - Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.
ER -