In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
Zhi LIU
North China University of Technology
Fangyuan ZHAO
North China University of Technology
Mengmeng ZHANG
North China University of Technology,Beijing Polytechnic College
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Zhi LIU, Fangyuan ZHAO, Mengmeng ZHANG, "An Efficient Multimodal Aggregation Network for Video-Text Retrieval" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 10, pp. 1825-1828, October 2022, doi: 10.1587/transinf.2022EDL8018.
Abstract: In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022EDL8018/_p
Copy
@ARTICLE{e105-d_10_1825,
author={Zhi LIU, Fangyuan ZHAO, Mengmeng ZHANG, },
journal={IEICE TRANSACTIONS on Information},
title={An Efficient Multimodal Aggregation Network for Video-Text Retrieval},
year={2022},
volume={E105-D},
number={10},
pages={1825-1828},
abstract={In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.},
keywords={},
doi={10.1587/transinf.2022EDL8018},
ISSN={1745-1361},
month={October},}
Copy
TY - JOUR
TI - An Efficient Multimodal Aggregation Network for Video-Text Retrieval
T2 - IEICE TRANSACTIONS on Information
SP - 1825
EP - 1828
AU - Zhi LIU
AU - Fangyuan ZHAO
AU - Mengmeng ZHANG
PY - 2022
DO - 10.1587/transinf.2022EDL8018
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2022
AB - In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
ER -