The search functionality is under construction.

Keyword Search Result

[Keyword] video representation(4hit)

1-4hit
  • Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition

    Shilei CHENG  Mei XIE  Zheng MA  Siqi LI  Song GU  Feng YANG  

     
    LETTER-Biocybernetics, Neurocomputing

      Pubricized:
    2020/10/01
      Vol:
    E104-D No:1
      Page(s):
    220-224

    As characterizing videos simultaneously from spatial and temporal cues have been shown crucial for video processing, with the shortage of temporal information of soft assignment, the vector of locally aggregated descriptor (VLAD) should be considered as a suboptimal framework for learning the spatio-temporal video representation. With the development of attention mechanisms in natural language processing, in this work, we present a novel model with VLAD following spatio-temporal self-attention operations, named spatio-temporal self-attention weighted VLAD (ST-SAWVLAD). In particular, sequential convolutional feature maps extracted from two modalities i.e., RGB and Flow are receptively fed into the self-attention module to learn soft spatio-temporal assignments parameters, which enabling aggregate not only detailed spatial information but also fine motion information from successive video frames. In experiments, we evaluate ST-SAWVLAD by using competitive action recognition datasets, UCF101 and HMDB51, the results shcoutstanding performance. The source code is available at:https://github.com/badstones/st-sawvlad.

  • Video Retrieval System for Bridging the Semantic Gap

    Min Young JUNG  Sung Han PARK  

     
    LETTER-Database

      Vol:
    E92-D No:12
      Page(s):
    2516-2519

    We propose a video ontology system to overcome semantic gap in video retrieval. The proposed video ontology is aimed at bridging of the gap between the semantic nature of user queries and raw video contents. Also, results of semantic retrieval shows not only the concept of topic keyword but also a sub-concept of the topic keyword using semantic query extension. Through this process, recall is likely to provide high accuracy results in our method. The experiments compared with keyframe-based indexing have demonstrated that this proposed scene-based indexing presents better results in several kinds of videos.

  • Representation of Dynamic 3D Objects Using the Coaxial Camera System

    Takayuki YASUNO  Jun'ichi ICHIMURA  Yasuhiko YASUDA  

     
    PAPER

      Vol:
    E79-B No:10
      Page(s):
    1484-1490

    3D model-based coding methods that need 3D reconstruction techniques are proposed for next-generation image coding methods. A method is presented that reconstructs 3D shapes of dynamic objects from image sequences captured using two cameras, thus avoiding the stereo correspondence problem. A coaxial camera system consisting of one moving and one static camera was developed. The optical axes of both cameras are precisely adjusted and have the same orientation using an optical system with true and half mirrors. The moving camera is moved along a straight horizontal line. This method can reconstruct 3D shapes of static objects as well as dynamic objects using motion vectors calculated from the moving camera images and revised using the static camera image. The method was tested successfully on real images by reconstructing a moving human shape.

  • Object Surface Representation Using Occlusion Analysis of Spatiotemporal Images*

    Takayuki YASUNO  Satoshi SUZUKI  Yasuhiko YASUDA  

     
    PAPER

      Vol:
    E79-D No:6
      Page(s):
    764-771

    Three dimensional model based coding methods are proposed as next generation image coding methods. These new representations need 3D reconstruction techniques. This paper presents a method that extracts the surfaces of static objects that occlude other objects from a spatiotemporal image captured with straight-line camera motion. We propose the concept of occlusion types and show that the occlusion types are restricted to only eight patterns. Furthermore, we show occlusion type pairs contain information that confirms the existence of surfaces. Occlusion information gives strong cues for segmentation and representation. The method can estimate not only the 3D positions of edge points but also the surfaces bounded by the edge points. We show that combinations of occlusion types contain information that can confirm surface existence. The method was tested successfully on real images by reconstructing flat and curved surfaces. Videos can be hierarchically structured with the method. The method makes various applications possible, such as object selective image communication and object selective video editing.