The search functionality is under construction.

Keyword Search Result

[Keyword] VLAD(3hit)

1-3hit
  • An Efficient Multimodal Aggregation Network for Video-Text Retrieval

    Zhi LIU  Fangyuan ZHAO  Mengmeng ZHANG  

     
    LETTER-Image Processing and Video Processing

      Pubricized:
    2022/06/27
      Vol:
    E105-D No:10
      Page(s):
    1825-1828

    In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.

  • Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition

    Shilei CHENG  Mei XIE  Zheng MA  Siqi LI  Song GU  Feng YANG  

     
    LETTER-Biocybernetics, Neurocomputing

      Pubricized:
    2020/10/01
      Vol:
    E104-D No:1
      Page(s):
    220-224

    As characterizing videos simultaneously from spatial and temporal cues have been shown crucial for video processing, with the shortage of temporal information of soft assignment, the vector of locally aggregated descriptor (VLAD) should be considered as a suboptimal framework for learning the spatio-temporal video representation. With the development of attention mechanisms in natural language processing, in this work, we present a novel model with VLAD following spatio-temporal self-attention operations, named spatio-temporal self-attention weighted VLAD (ST-SAWVLAD). In particular, sequential convolutional feature maps extracted from two modalities i.e., RGB and Flow are receptively fed into the self-attention module to learn soft spatio-temporal assignments parameters, which enabling aggregate not only detailed spatial information but also fine motion information from successive video frames. In experiments, we evaluate ST-SAWVLAD by using competitive action recognition datasets, UCF101 and HMDB51, the results shcoutstanding performance. The source code is available at:https://github.com/badstones/st-sawvlad.

  • Local Multi-Coordinate System for Object Retrieval

    Go IRIE  Yukito WATANABE  Takayuki KUROZUMI  Tetsuya KINEBUCHI  

     
    LETTER-Image Processing and Video Processing

      Pubricized:
    2016/07/06
      Vol:
    E99-D No:10
      Page(s):
    2656-2660

    Encoding multiple SIFT descriptors into a single vector is a key technique for efficient object image retrieval. In this paper, we propose an extension of local coordinate system (LCS) for image representation. The previous LCS approaches encode each SIFT descriptor by a single local coordinate, which is not adequate for localizing its position in the descriptor space. Instead, we use multiple local coordinates to represent each descriptor with PCA-based decorrelation. Experiments show that this simple modification can improve retrieval performance significantly.