The search functionality is under construction.

Keyword Search Result

[Keyword] scene classification(14hit)

1-14hit
  • Dual-Path Convolutional Neural Network Based on Band Interaction Block for Acoustic Scene Classification Open Access

    Pengxu JIANG  Yang YANG  Yue XIE  Cairong ZOU  Qingyun WANG  

     
    LETTER-Engineering Acoustics

      Pubricized:
    2023/10/04
      Vol:
    E107-A No:7
      Page(s):
    1040-1044

    Convolutional neural network (CNN) is widely used in acoustic scene classification (ASC) tasks. In most cases, local convolution is utilized to gather time-frequency information between spectrum nodes. It is challenging to adequately express the non-local link between frequency domains in a finite convolution region. In this paper, we propose a dual-path convolutional neural network based on band interaction block (DCNN-bi) for ASC, with mel-spectrogram as the model’s input. We build two parallel CNN paths to learn the high-frequency and low-frequency components of the input feature. Additionally, we have created three band interaction blocks (bi-blocks) to explore the pertinent nodes between various frequency bands, which are connected between two paths. Combining the time-frequency information from two paths, the bi-blocks with three distinct designs acquire non-local information and send it back to the respective paths. The experimental results indicate that the utilization of the bi-block has the potential to improve the initial performance of the CNN substantially. Specifically, when applied to the DCASE 2018 and DCASE 2020 datasets, the CNN exhibited performance improvements of 1.79% and 3.06%, respectively.

  • Research on Lightweight Acoustic Scene Perception Method Based on Drunkard Methodology

    Wenkai LIU  Lin ZHANG  Menglong WU  Xichang CAI  Hongxia DONG  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/10/23
      Vol:
    E107-D No:1
      Page(s):
    83-92

    The goal of Acoustic Scene Classification (ASC) is to simulate human analysis of the surrounding environment and make accurate decisions promptly. Extracting useful information from audio signals in real-world scenarios is challenging and can lead to suboptimal performance in acoustic scene classification, especially in environments with relatively homogeneous backgrounds. To address this problem, we model the sobering-up process of “drunkards” in real-life and the guiding behavior of normal people, and construct a high-precision lightweight model implementation methodology called the “drunkard methodology”. The core idea includes three parts: (1) designing a special feature transformation module based on the different mechanisms of information perception between drunkards and ordinary people, to simulate the process of gradually sobering up and the changes in feature perception ability; (2) studying a lightweight “drunken” model that matches the normal model's perception processing process. The model uses a multi-scale class residual block structure and can obtain finer feature representations by fusing information extracted at different scales; (3) introducing a guiding and fusion module of the conventional model to the “drunken” model to speed up the sobering-up process and achieve iterative optimization and accuracy improvement. Evaluation results on the official dataset of DCASE2022 Task1 demonstrate that our baseline system achieves 40.4% accuracy and 2.284 loss under the condition of 442.67K parameters and 19.40M MAC (multiply-accumulate operations). After adopting the “drunkard” mechanism, the accuracy is improved to 45.2%, and the loss is reduced by 0.634 under the condition of 551.89K parameters and 23.6M MAC.

  • An Integrated Convolutional Neural Network with a Fusion Attention Mechanism for Acoustic Scene Classification

    Pengxu JIANG  Yue XIE  Cairong ZOU  Li ZHAO  Qingyun WANG  

     
    LETTER-Engineering Acoustics

      Pubricized:
    2023/02/06
      Vol:
    E106-A No:8
      Page(s):
    1057-1061

    In human-computer interaction, acoustic scene classification (ASC) is one of the relevant research domains. In real life, the recorded audio may include a lot of noise and quiet clips, making it hard for earlier ASC-based research to isolate the crucial scene information in sound. Furthermore, scene information may be scattered across numerous audio frames; hence, selecting scene-related frames is crucial for ASC. In this context, an integrated convolutional neural network with a fusion attention mechanism (ICNN-FA) is proposed for ASC. Firstly, segmented mel-spectrograms as the input of ICNN can assist the model in learning the short-term time-frequency correlation information. Then, the designed ICNN model is employed to learn these segment-level features. In addition, the proposed global attention layer may gather global information by integrating these segment features. Finally, the developed fusion attention layer is utilized to fuse all segment-level features while the classifier classifies various situations. Experimental findings using ASC datasets from DCASE 2018 and 2019 indicate the efficacy of the suggested method.

  • Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

    Noriyuki TONAMI  Keisuke IMOTO  Ryosuke YAMANISHI  Yoichi YAMASHITA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/11/19
      Vol:
    E104-D No:2
      Page(s):
    294-301

    Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene “office,” the sound events “mouse clicking” and “keyboard typing” are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.

  • A Novel Discriminative Feature Extraction for Acoustic Scene Classification Using RNN Based Source Separation

    Seongkyu MUN  Suwon SHON  Wooil KIM  David K. HAN  Hanseok KO  

     
    LETTER-Artificial Intelligence, Data Mining

      Pubricized:
    2017/09/14
      Vol:
    E100-D No:12
      Page(s):
    3041-3044

    Various types of classifiers and feature extraction methods for acoustic scene classification have been recently proposed in the IEEE Detection and Classification of Acoustic Scenes and Events (DCASE) 2016 Challenge Task 1. The results of the final evaluation, however, have shown that even top 10 ranked teams, showed extremely low accuracy performance in particular class pairs with similar sounds. Due to such sound classes being difficult to distinguish even by human ears, the conventional deep learning based feature extraction methods, as used by most DCASE participating teams, are considered facing performance limitations. To address the low performance problem in similar class pair cases, this letter proposes to employ a recurrent neural network (RNN) based source separation for each class prior to the classification step. Based on the fact that the system can effectively extract trained sound components using the RNN structure, the mid-layer of the RNN can be considered to capture discriminative information of the trained class. Therefore, this letter proposes to use this mid-layer information as novel discriminative features. The proposed feature shows an average classification rate improvement of 2.3% compared to the conventional method, which uses additional classifiers for the similar class pair issue.

  • LLC Revisit: Scene Classification with k-Farthest Neighbours

    Katsuyuki TANAKA  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2016/02/08
      Vol:
    E99-D No:5
      Page(s):
    1375-1383

    This paper introduces a simple but effective way to boost the performance of scene classification through a novel approach to the LLC coding process. In our proposed method, a local descriptor is encoded not only with k-nearest visual words but also with k-farthest visual words to produce more discriminative code. Since the proposed method is a simple modification of the image classification model, it can be easily integrated into various existing BoF models proposed in various areas, such as coding, pooling, to boost their scene classification performance. The results of experiments conducted with three scene datasets: 15-Scenes, MIT-Indoor67, and Sun367 show that adding k-farthest visual words better enhances scene classification performance than increasing the number of k-nearest visual words.

  • Indoor Scene Classification Based on the Bag-of-Words Model of Local Feature Information Gain

    Rong WANG  Zhiliang WANG  Xirong MA  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E96-D No:4
      Page(s):
    984-987

    For the problem of Indoor Home Scene Classification, this paper proposes the BOW Model of Local Feature Information Gain. The experimental results show that not only the performance is improved but also the computation is reduced. Consequently this method out performs the state-of-the-art approach.

  • Global Selection vs Local Ordering of Color SIFT Independent Components for Object/Scene Classification

    Dan-ni AI  Xian-hua HAN  Guifang DUAN  Xiang RUAN  Yen-wei CHEN  

     
    PAPER-Pattern Recognition

      Vol:
    E94-D No:9
      Page(s):
    1800-1808

    This paper addresses the problem of ordering the color SIFT descriptors in the independent component analysis for image classification. Component ordering is of great importance for image classification, since it is the foundation of feature selection. To select distinctive and compact independent components (IC) of the color SIFT descriptors, we propose two ordering approaches based on local variation, named as the localization-based IC ordering and the sparseness-based IC ordering. We evaluate the performance of proposed methods, the conventional IC selection method (global variation based components selection) and original color SIFT descriptors on object and scene databases, and obtain the following two main results. First, the proposed methods are able to obtain acceptable classification results in comparison with original color SIFT descriptors. Second, the highest classification rate can be obtained by using the global selection method in the scene database, while the local ordering methods give the best performance for the object database.

  • Multi-Scale Multi-Level Generative Model in Scene Classification

    Wenjie XIE  De XU  Yingjun TANG  Geng CUI  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E94-D No:1
      Page(s):
    167-170

    Previous works show that the probabilistic Latent Semantic Analysis (pLSA) model is one of the best generative models for scene categorization and can obtain an acceptable classification accuracy. However, this method uses a certain number of topics to construct the final image representation. In such a way, it restricts the image description to one level of visual detail and cannot generate a higher accuracy rate. In order to solve this problem, we propose a novel generative model, which is referred to as multi-scale multi-level probabilistic Latent Semantic Analysis model (msml-pLSA). This method consists of two parts: multi-scale part, which extracts visual details from the image of diverse resolutions, and multi-level part, which concentrates multiple levels of topic representation to model scene. The msml-pLSA model allows for the description of fine and coarse local image detail in one framework. The proposed method is evaluated on the well-known scene classification dataset with 15 scene categories, and experimental results show that the proposed msml-pLSA model can improve the classification accuracy compared with the typical classification methods.

  • Color Independent Components Based SIFT Descriptors for Object/Scene Classification

    Dan-ni AI  Xian-hua HAN  Xiang RUAN  Yen-wei CHEN  

     
    PAPER-Pattern Recognition

      Vol:
    E93-D No:9
      Page(s):
    2577-2586

    In this paper, we present a novel color independent components based SIFT descriptor (termed CIC-SIFT) for object/scene classification. We first learn an efficient color transformation matrix based on independent component analysis (ICA), which is adaptive to each category in a database. The ICA-based color transformation can enhance contrast between the objects and the background in an image. Then we compute CIC-SIFT descriptors over all three transformed color independent components. Since the ICA-based color transformation can boost the objects and suppress the background, the proposed CIC-SIFT can extract more effective and discriminative local features for object/scene classification. The comparison is performed among seven SIFT descriptors, and the experimental classification results show that our proposed CIC-SIFT is superior to other conventional SIFT descriptors.

  • Discriminating Semantic Visual Words for Scene Classification

    Shuoyan LIU  De XU  Songhe FENG  

     
    PAPER-Pattern Recognition

      Vol:
    E93-D No:6
      Page(s):
    1580-1588

    Bag-of-Visual-Words representation has recently become popular for scene classification. However, learning the visual words in an unsupervised manner suffers from the problem when faced these patches with similar appearances corresponding to distinct semantic concepts. This paper proposes a novel supervised learning framework, which aims at taking full advantage of label information to address the problem. Specifically, the Gaussian Mixture Modeling (GMM) is firstly applied to obtain "semantic interpretation" of patches using scene labels. Each scene induces a probability density on the low-level visual features space, and patches are represented as vectors of posterior scene semantic concepts probabilities. And then the Information Bottleneck (IB) algorithm is introduce to cluster the patches into "visual words" via a supervised manner, from the perspective of semantic interpretations. Such operation can maximize the semantic information of the visual words. Once obtained the visual words, the appearing frequency of the corresponding visual words in a given image forms a histogram, which can be subsequently used in the scene categorization task via the Support Vector Machine (SVM) classifier. Experiments on a challenging dataset show that the proposed visual words better perform scene classification task than most existing methods.

  • Natural Scene Classification Based on Integrated Topic Simplex

    Tang YINGJUN  Xu DE  Yang XU  Liu QIFANG  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E92-D No:9
      Page(s):
    1811-1814

    We present a novel model named Integrated Latent Topic Model (ILTM), to learn and recognize natural scene category. Unlike previous work, which considered the discrepancy and common property separately among all categories, Our approach combines universal topics from all categories with specific topics from each category. As a result, the model is implemented to produce a few but specific topics and more generic topics among categories, and each category is represented in a different topics simplex, which correlates well with human scene understanding. We investigate the classification performance with variable scene category tasks. The experiments have shown our model outperforms latent-space methods with less training data.

  • Category Constrained Learning Model for Scene Classification

    Yingjun TANG  De XU  Guanghua GU  Shuoyan LIU  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E92-D No:2
      Page(s):
    357-360

    We present a novel model, named Category Constraint-Latent Dirichlet Allocation (CC-LDA), to learn and recognize natural scene category. Previous work had to resort to additional classifier after obtaining image topic representation. Our model puts the category information in topic inference, so every category is represented in a different topics simplex and topic size, which is consistent with human cognitive habit. The significant feature in our model is that it can do discrimination without combined additional classifier, during the same time of getting topic representation. We investigate the classification performance with variable scene category tasks. The experiments have demonstrated that our learning model can get better performance with less training data.

  • Adaptively Combining Local with Global Information for Natural Scenes Categorization

    Shuoyan LIU  De XU  Xu YANG  

     
    LETTER-Image Recognition, Computer Vision

      Vol:
    E91-D No:7
      Page(s):
    2087-2090

    This paper proposes the Extended Bag-of-Visterms (EBOV) to represent semantic scenes. In previous methods, most representations are bag-of-visterms (BOV), where visterms referred to the quantized local texture information. Our new representation is built by introducing global texture information to extend standard bag-of-visterms. In particular we apply the adaptive weight to fuse the local and global information together in order to provide a better visterm representation. Given these representations, scene classification can be performed by pLSA (probabilistic Latent Semantic Analysis) model. The experiment results show that the appropriate use of global information improves the performance of scene classification, as compared with BOV representation that only takes the local information into account.