The search functionality is under construction.

IEICE TRANSACTIONS on Information

An Efficient Multimodal Aggregation Network for Video-Text Retrieval

Zhi LIU, Fangyuan ZHAO, Mengmeng ZHANG

  • Full Text Views

    0

  • Cite this

Summary :

In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.

Publication
IEICE TRANSACTIONS on Information Vol.E105-D No.10 pp.1825-1828
Publication Date
2022/10/01
Publicized
2022/06/27
Online ISSN
1745-1361
DOI
10.1587/transinf.2022EDL8018
Type of Manuscript
LETTER
Category
Image Processing and Video Processing

Authors

Zhi LIU
  North China University of Technology
Fangyuan ZHAO
  North China University of Technology
Mengmeng ZHANG
  North China University of Technology,Beijing Polytechnic College

Keyword