IEICE global.ieice.org Site

The search functionality is under construction.

The search functionality is under construction.

Author Search Result

[Author] Fangyuan ZHAO(1hit)

1-1hit

An Efficient Multimodal Aggregation Network for Video-Text Retrieval
Zhi LIU Fangyuan ZHAO Mengmeng ZHANG

LETTER-Image Processing and Video Processing

Pubricized:
2022/06/27
Vol:
E105-D No:10
Page(s):
1825-1828
In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.

Latest Issue

English

Links

Call for Papers

Call for Papers

Special Section

Submit to IEICE Trans.

Submit to IEICE Trans.

Information for Authors

Transactions NEWS

Transactions NEWS

Popular articles

Popular articles

Top 10 Downloads