The search functionality is under construction.
The search functionality is under construction.

Vision-Text Time Series Correlation for Visual-to-Language Story Generation

Rizal Setya PERDANA, Yoshiteru ISHIDA

  • Full Text Views

    0

  • Cite this

Summary :

Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.

Publication
IEICE TRANSACTIONS on Information Vol.E104-D No.6 pp.828-839
Publication Date
2021/06/01
Publicized
2021/03/08
Online ISSN
1745-1361
DOI
10.1587/transinf.2020EDP7131
Type of Manuscript
PAPER
Category
Artificial Intelligence, Data Mining

Authors

Rizal Setya PERDANA
  Toyohashi University of Technology,Universitas Brawijaya
Yoshiteru ISHIDA
  Toyohashi University of Technology

Keyword