The search functionality is under construction.

IEICE TRANSACTIONS on Information

Target-Oriented Deformation of Visual-Semantic Embedding Space

Takashi MATSUBARA

  • Full Text Views

    0

  • Cite this

Summary :

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.

Publication
IEICE TRANSACTIONS on Information Vol.E104-D No.1 pp.24-33
Publication Date
2021/01/01
Publicized
2020/09/24
Online ISSN
1745-1361
DOI
10.1587/transinf.2020MUP0003
Type of Manuscript
Special Section PAPER (Special Section on Enriched Multimedia — Multimedia Security and Forensics —)
Category

Authors

Takashi MATSUBARA
  Kobe University

Keyword