Target-Oriented Deformation of Visual-Semantic Embedding Space

Takashi MATSUBARA

doi:10.1587/transinf.2020MUP0003

IEICE TRANSACTIONS on Information

Target-Oriented Deformation of Visual-Semantic Embedding Space

Takashi MATSUBARA

Full Text Views

0

Cite this

Summary :

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.

Publication: IEICE TRANSACTIONS on Information Vol.E104-D No.1 pp.24-33

Publication Date: 2021/01/01

Publicized: 2020/09/24

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2020MUP0003

Type of Manuscript: Special Section PAPER (Special Section on Enriched Multimedia — Multimedia Security and Forensics —)

Category

Cite this

Copy

Takashi MATSUBARA, "Target-Oriented Deformation of Visual-Semantic Embedding Space" in IEICE TRANSACTIONS on Information, vol. E104-D, no. 1, pp. 24-33, January 2021, doi: 10.1587/transinf.2020MUP0003.
Abstract: Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020MUP0003/_p

Copy

@ARTICLE{e104-d_1_24,
author={Takashi MATSUBARA, },
journal={IEICE TRANSACTIONS on Information},
title={Target-Oriented Deformation of Visual-Semantic Embedding Space},
year={2021},
volume={E104-D},
number={1},
pages={24-33},
abstract={Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.},
keywords={},
doi={10.1587/transinf.2020MUP0003},
ISSN={1745-1361},
month={January},}

Copy

TY - JOUR
TI - Target-Oriented Deformation of Visual-Semantic Embedding Space
T2 - IEICE TRANSACTIONS on Information
SP - 24
EP - 33
AU - Takashi MATSUBARA
PY - 2021
DO - 10.1587/transinf.2020MUP0003
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2021
AB - Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.
ER -

IEICE TRANSACTIONS on Information

Target-Oriented Deformation of Visual-Semantic Embedding Space

Summary :

Authors

Keyword

Latest Issue

Contents

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles

IEICE TRANSACTIONS on Information

Target-Oriented Deformation of Visual-Semantic Embedding Space

Summary :

Authors

Keyword

Latest Issue

Contents

Copyrights notice of machine-translated contents

Cite this

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles