An Efficient Multimodal Aggregation Network for Video-Text Retrieval

Zhi LIU; Fangyuan ZHAO; Mengmeng ZHANG

doi:10.1587/transinf.2022EDL8018

IEICE TRANSACTIONS on Information

An Efficient Multimodal Aggregation Network for Video-Text Retrieval

Zhi LIU, Fangyuan ZHAO, Mengmeng ZHANG

Full Text Views

0

Cite this

Summary :

In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.

Publication: IEICE TRANSACTIONS on Information Vol.E105-D No.10 pp.1825-1828

Publication Date: 2022/10/01

Publicized: 2022/06/27

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2022EDL8018

Type of Manuscript: LETTER

Category: Image Processing and Video Processing

Authors

Zhi LIU
  North China University of Technology
Fangyuan ZHAO
  North China University of Technology
Mengmeng ZHANG
  North China University of Technology,Beijing Polytechnic College

Keyword

video retrieval, CLIP, cross-modal, NetVLAD, multi-modal

Cite this

Copy

Zhi LIU, Fangyuan ZHAO, Mengmeng ZHANG, "An Efficient Multimodal Aggregation Network for Video-Text Retrieval" in IEICE TRANSACTIONS on Information, vol. E105-D, no. 10, pp. 1825-1828, October 2022, doi: 10.1587/transinf.2022EDL8018.
Abstract: In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022EDL8018/_p

Copy

@ARTICLE{e105-d_10_1825,
author={Zhi LIU, Fangyuan ZHAO, Mengmeng ZHANG, },
journal={IEICE TRANSACTIONS on Information},
title={An Efficient Multimodal Aggregation Network for Video-Text Retrieval},
year={2022},
volume={E105-D},
number={10},
pages={1825-1828},
abstract={In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.},
keywords={},
doi={10.1587/transinf.2022EDL8018},
ISSN={1745-1361},
month={October},}

Copy

TY - JOUR
TI - An Efficient Multimodal Aggregation Network for Video-Text Retrieval
T2 - IEICE TRANSACTIONS on Information
SP - 1825
EP - 1828
AU - Zhi LIU
AU - Fangyuan ZHAO
AU - Mengmeng ZHANG
PY - 2022
DO - 10.1587/transinf.2022EDL8018
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2022
AB - In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
ER -

IEICE TRANSACTIONS on Information