The search functionality is under construction.

Keyword Search Result

[Keyword] human action recognition(9hit)

1-9hit
  • Spatio-Temporal Self-Attention Weighted VLAD Neural Network for Action Recognition

    Shilei CHENG  Mei XIE  Zheng MA  Siqi LI  Song GU  Feng YANG  

     
    LETTER-Biocybernetics, Neurocomputing

      Pubricized:
    2020/10/01
      Vol:
    E104-D No:1
      Page(s):
    220-224

    As characterizing videos simultaneously from spatial and temporal cues have been shown crucial for video processing, with the shortage of temporal information of soft assignment, the vector of locally aggregated descriptor (VLAD) should be considered as a suboptimal framework for learning the spatio-temporal video representation. With the development of attention mechanisms in natural language processing, in this work, we present a novel model with VLAD following spatio-temporal self-attention operations, named spatio-temporal self-attention weighted VLAD (ST-SAWVLAD). In particular, sequential convolutional feature maps extracted from two modalities i.e., RGB and Flow are receptively fed into the self-attention module to learn soft spatio-temporal assignments parameters, which enabling aggregate not only detailed spatial information but also fine motion information from successive video frames. In experiments, we evaluate ST-SAWVLAD by using competitive action recognition datasets, UCF101 and HMDB51, the results shcoutstanding performance. The source code is available at:https://github.com/badstones/st-sawvlad.

  • Action Recognition Using Low-Rank Sparse Representation

    Shilei CHENG  Song GU  Maoquan YE  Mei XIE  

     
    LETTER-Image Recognition, Computer Vision

      Pubricized:
    2017/11/24
      Vol:
    E101-D No:3
      Page(s):
    830-834

    Human action recognition in videos draws huge research interests in computer vision. The Bag-of-Word model is quite commonly used to obtain the video level representations, however, BoW model roughly assigns each feature vector to its nearest visual word and the collection of unordered words ignores the interest points' spatial information, inevitably causing nontrivial quantization errors and impairing improvements on classification rates. To address these drawbacks, we propose an approach for action recognition by encoding spatio-temporal log Euclidean covariance matrix (ST-LECM) features within the low-rank and sparse representation framework. Motivated by low rank matrix recovery, local descriptors in a spatial temporal neighborhood have similar representation and should be approximately low rank. The learned coefficients can not only capture the global data structures, but also preserve consistent. Experimental results showed that the proposed approach yields excellent recognition performance on synthetic video datasets and are robust to action variability, view variations and partial occlusion.

  • Learning a Similarity Constrained Discriminative Kernel Dictionary from Concatenated Low-Rank Features for Action Recognition

    Shijian HUANG  Junyong YE  Tongqing WANG  Li JIANG  Changyuan XING  Yang LI  

     
    LETTER-Pattern Recognition

      Pubricized:
    2015/11/16
      Vol:
    E99-D No:2
      Page(s):
    541-544

    Traditional low-rank feature lose the temporal information among action sequence. To obtain the temporal information, we split an action video into multiple action subsequences and concatenate all the low-rank features of subsequences according to their time order. Then we recognize actions by learning a novel dictionary model from concatenated low-rank features. However, traditional dictionary learning models usually neglect the similarity among the coding coefficients and have bad performance in dealing with non-linearly separable data. To overcome these shortcomings, we present a novel similarity constrained discriminative kernel dictionary learning for action recognition. The effectiveness of the proposed method is verified on three benchmarks, and the experimental results show the promising results of our method for action recognition.

  • Gradient-Flow Tensor Divergence Feature for Human Action Recognition

    Ngoc Nam BUI  Jin Young KIM  Hyoung-Gook KIM  

     
    LETTER-Vision

      Vol:
    E99-A No:1
      Page(s):
    437-440

    Current research trends in computer vision have tended towards achieving the goal of recognizing human action, due to the potential utility of such recognition in various applications. Among many potential approaches, an approach involving Gaussian Mixture Model (GMM) supervectors with a Support Vector Machine (SVM) and a nonlinear GMM KL kernel has been proven to yield improved performance for recognizing human activities. In this study, based on tensor analysis, we develop and exploit an extended class of action features that we refer to as gradient-flow tensor divergence. The proposed method has shown a best recognition rate of 96.3% for a KTH dataset, and reduced processing time.

  • Statistics on Temporal Changes of Sparse Coding Coefficients in Spatial Pyramids for Human Action Recognition

    Yang LI  Junyong YE  Tongqing WANG  Shijian HUANG  

     
    LETTER-Pattern Recognition

      Pubricized:
    2015/06/01
      Vol:
    E98-D No:9
      Page(s):
    1711-1714

    Traditional sparse representation-based methods for human action recognition usually pool over the entire video to form the final feature representation, neglecting any spatio-temporal information of features. To employ spatio-temporal information, we present a novel histogram representation obtained by statistics on temporal changes of sparse coding coefficients frame by frame in the spatial pyramids constructed from videos. The histograms are further fed into a support vector machine with a spatial pyramid matching kernel for final action classification. We validate our method on two benchmarks, KTH and UCF Sports, and experiment results show the effectiveness of our method in human action recognition.

  • Contextual Max Pooling for Human Action Recognition

    Zhong ZHANG  Shuang LIU  Xing MEI  

     
    LETTER-Image Recognition, Computer Vision

      Pubricized:
    2015/01/19
      Vol:
    E98-D No:4
      Page(s):
    989-993

    The bag-of-words model (BOW) has been extensively adopted by recent human action recognition methods. The pooling operation, which aggregates local descriptor encodings into a single representation, is a key determiner of the performance of the BOW-based methods. However, the spatio-temporal relationship among interest points has rarely been considered in the pooling step, which results in the imprecise representation of human actions. In this paper, we propose a novel pooling strategy named contextual max pooling (CMP) to overcome this limitation. We add a constraint term into the objective function under the framework of max pooling, which forces the weights of interest points to be consistent with their probabilities. In this way, CMP explicitly considers the spatio-temporal contextual relationships among interest points and inherits the positive properties of max pooling. Our method is verified on three challenging datasets (KTH, UCF Sports and UCF Films datasets), and the results demonstrate that our method achieves better results than the state-of-the-art methods in human action recognition.

  • Topic-Based Knowledge Transfer Algorithm for Cross-View Action Recognition

    Changhong CHEN  Shunqing YANG  Zongliang GAN  

     
    LETTER-Pattern Recognition

      Vol:
    E97-D No:3
      Page(s):
    614-617

    Cross-view action recognition is a challenging research field for human motion analysis. Appearance-based features are not credible if the viewpoint changes. In this paper, a new framework is proposed for cross-view action recognition by topic based knowledge transfer. First, Spatio-temporal descriptors are extracted from the action videos and each video is modeled by a bag of visual words (BoVW) based on the codebook constructed by the k-means cluster algorithm. Second, Latent Dirichlet Allocation (LDA) is employed to assign topics for the BoVW representation. The topic distribution of visual words (ToVW) is normalized and taken to be the feature vector. Third, in order to bridge different views, we transform ToVW into bilingual ToVW by constructing bilingual dictionaries, which guarantee that the same action has the same representation from different views. We demonstrate the effectiveness of the proposed algorithm on the IXMAS multi-view dataset.

  • Selecting Effective and Discriminative Spatio-Temporal Interest Points for Recognizing Human Action

    Hongbo ZHANG  Shaozi LI  Songzhi SU  Shu-Yuan CHEN  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E96-D No:8
      Page(s):
    1783-1792

    Many successful methods for recognizing human action are spatio-temporal interest point (STIP) based methods. Given a test video sequence, for a matching-based method using a voting mechanism, each test STIP casts a vote for each action class based on its mutual information with respect to the respective class, which is measured in terms of class likelihood probability. Therefore, two issues should be addressed to improve the accuracy of action recognition. First, effective STIPs in the training set must be selected as references for accurately estimating probability. Second, discriminative STIPs in the test set must be selected for voting. This work uses ε-nearest neighbors as effective STIPs for estimating the class probability and uses a variance filter for selecting discriminative STIPs. Experimental results verify that the proposed method is more accurate than existing action recognition methods.

  • A Vision-Based Emergency Response System with a Paramedic Mobile Robot

    Il-Woong JEONG  Jin CHOI  Kyusung CHO  Yong-Ho SEO  Hyun Seung YANG  

     
    PAPER

      Vol:
    E93-D No:7
      Page(s):
    1745-1753

    Detecting emergency situation is very important to a surveillance system for people like elderly live alone. A vision-based emergency response system with a paramedic mobile robot is presented in this paper. The proposed system is consisted of a vision-based emergency detection system and a mobile robot as a paramedic. A vision-based emergency detection system detects emergency by tracking people and detecting their actions from image sequences acquired by single surveillance camera. In order to recognize human actions, interest regions are segmented from the background using blob extraction method and tracked continuously using generic model. Then a MHI (Motion History Image) for a tracked person is constructed by silhouette information of region blobs and model actions. Emergency situation is finally detected by applying these information to neural network. When an emergency is detected, a mobile robot can help to diagnose the status of the person in the situation. To send the mobile robot to the proper position, we implement mobile robot navigation algorithm based on the distance between the person and a mobile robot. We validate our system by showing emergency detection rate and emergency response demonstration using the mobile robot.