The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] acoustic(178hit)

1-20hit(178hit)

  • Dual-Path Convolutional Neural Network Based on Band Interaction Block for Acoustic Scene Classification Open Access

    Pengxu JIANG  Yang YANG  Yue XIE  Cairong ZOU  Qingyun WANG  

     
    LETTER-Engineering Acoustics

      Pubricized:
    2023/10/04
      Vol:
    E107-A No:7
      Page(s):
    1040-1044

    Convolutional neural network (CNN) is widely used in acoustic scene classification (ASC) tasks. In most cases, local convolution is utilized to gather time-frequency information between spectrum nodes. It is challenging to adequately express the non-local link between frequency domains in a finite convolution region. In this paper, we propose a dual-path convolutional neural network based on band interaction block (DCNN-bi) for ASC, with mel-spectrogram as the model’s input. We build two parallel CNN paths to learn the high-frequency and low-frequency components of the input feature. Additionally, we have created three band interaction blocks (bi-blocks) to explore the pertinent nodes between various frequency bands, which are connected between two paths. Combining the time-frequency information from two paths, the bi-blocks with three distinct designs acquire non-local information and send it back to the respective paths. The experimental results indicate that the utilization of the bi-block has the potential to improve the initial performance of the CNN substantially. Specifically, when applied to the DCASE 2018 and DCASE 2020 datasets, the CNN exhibited performance improvements of 1.79% and 3.06%, respectively.

  • Simultaneous Adaptation of Acoustic and Language Models for Emotional Speech Recognition Using Tweet Data

    Tetsuo KOSAKA  Kazuya SAEKI  Yoshitaka AIZAWA  Masaharu KATO  Takashi NOSE  

     
    PAPER

      Pubricized:
    2023/12/05
      Vol:
    E107-D No:3
      Page(s):
    363-373

    Emotional speech recognition is generally considered more difficult than non-emotional speech recognition. The acoustic characteristics of emotional speech differ from those of non-emotional speech. Additionally, acoustic characteristics vary significantly depending on the type and intensity of emotions. Regarding linguistic features, emotional and colloquial expressions are also observed in their utterances. To solve these problems, we aim to improve recognition performance by adapting acoustic and language models to emotional speech. We used Japanese Twitter-based Emotional Speech (JTES) as an emotional speech corpus. This corpus consisted of tweets and had an emotional label assigned to each utterance. Corpus adaptation is possible using the utterances contained in this corpus. However, regarding the language model, the amount of adaptation data is insufficient. To solve this problem, we propose an adaptation of the language model by using online tweet data downloaded from the internet. The sentences used for adaptation were extracted from the tweet data based on certain rules. We extracted the data of 25.86 M words and used them for adaptation. In the recognition experiments, the baseline word error rate was 36.11%, whereas that with the acoustic and language model adaptation was 17.77%. The results demonstrated the effectiveness of the proposed method.

  • Research on Lightweight Acoustic Scene Perception Method Based on Drunkard Methodology

    Wenkai LIU  Lin ZHANG  Menglong WU  Xichang CAI  Hongxia DONG  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2023/10/23
      Vol:
    E107-D No:1
      Page(s):
    83-92

    The goal of Acoustic Scene Classification (ASC) is to simulate human analysis of the surrounding environment and make accurate decisions promptly. Extracting useful information from audio signals in real-world scenarios is challenging and can lead to suboptimal performance in acoustic scene classification, especially in environments with relatively homogeneous backgrounds. To address this problem, we model the sobering-up process of “drunkards” in real-life and the guiding behavior of normal people, and construct a high-precision lightweight model implementation methodology called the “drunkard methodology”. The core idea includes three parts: (1) designing a special feature transformation module based on the different mechanisms of information perception between drunkards and ordinary people, to simulate the process of gradually sobering up and the changes in feature perception ability; (2) studying a lightweight “drunken” model that matches the normal model's perception processing process. The model uses a multi-scale class residual block structure and can obtain finer feature representations by fusing information extracted at different scales; (3) introducing a guiding and fusion module of the conventional model to the “drunken” model to speed up the sobering-up process and achieve iterative optimization and accuracy improvement. Evaluation results on the official dataset of DCASE2022 Task1 demonstrate that our baseline system achieves 40.4% accuracy and 2.284 loss under the condition of 442.67K parameters and 19.40M MAC (multiply-accumulate operations). After adopting the “drunkard” mechanism, the accuracy is improved to 45.2%, and the loss is reduced by 0.634 under the condition of 551.89K parameters and 23.6M MAC.

  • Multi-Segment Verification FrFT Frame Synchronization Detection in Underwater Acoustic Communications

    Guojin LIAO  Yongpeng ZUO  Qiao LIAO  Xiaofeng TIAN  

     
    PAPER-Wireless Communication Technologies

      Pubricized:
    2023/09/01
      Vol:
    E106-B No:12
      Page(s):
    1501-1509

    Frame synchronization detection before data transmission is an important module which directly affects the lifetime and coexistence of underwater acoustic communication (UAC) networks, where linear frequency modulation (LFM) is a frame preamble signal commonly used for synchronization. Unlike terrestrial wireless communications, strong bursty noise frequently appears in UAC. Due to the long transmission distance and the low signal-to-noise ratio, strong short-distance bursty noise will greatly reduce the accuracy of conventional fractional fourier transform (FrFT) detection. We propose a multi-segment verification fractional fourier transform (MFrFT) preamble detection algorithm to address this challenge. In the proposed algorithm, 4 times of adjacent FrFT operations are carried out. And the LFM signal identifies by observing the linear correlation between two lines connected in pair among three adjacent peak points, called ‘dual-line-correlation mechanism’. The accurate starting time of the LFM signal can be found according to the peak frequency of the adjacent FrFT. More importantly, MFrFT do not result in an increase in computational complexity. Compared with the conventional FrFT detection method, experimental results show that the proposed algorithm can effectively distinguish between signal starting points and bursty noise with much lower error detection rate, which in turn minimizes the cost of retransmission.

  • An Integrated Convolutional Neural Network with a Fusion Attention Mechanism for Acoustic Scene Classification

    Pengxu JIANG  Yue XIE  Cairong ZOU  Li ZHAO  Qingyun WANG  

     
    LETTER-Engineering Acoustics

      Pubricized:
    2023/02/06
      Vol:
    E106-A No:8
      Page(s):
    1057-1061

    In human-computer interaction, acoustic scene classification (ASC) is one of the relevant research domains. In real life, the recorded audio may include a lot of noise and quiet clips, making it hard for earlier ASC-based research to isolate the crucial scene information in sound. Furthermore, scene information may be scattered across numerous audio frames; hence, selecting scene-related frames is crucial for ASC. In this context, an integrated convolutional neural network with a fusion attention mechanism (ICNN-FA) is proposed for ASC. Firstly, segmented mel-spectrograms as the input of ICNN can assist the model in learning the short-term time-frequency correlation information. Then, the designed ICNN model is employed to learn these segment-level features. In addition, the proposed global attention layer may gather global information by integrating these segment features. Finally, the developed fusion attention layer is utilized to fuse all segment-level features while the classifier classifies various situations. Experimental findings using ASC datasets from DCASE 2018 and 2019 indicate the efficacy of the suggested method.

  • L0-Norm Based Adaptive Equalization with PMSER Criterion for Underwater Acoustic Communications

    Tian FANG  Feng LIU  Conggai LI  Fangjiong CHEN  Yanli XU  

     
    LETTER-Communication Theory and Signals

      Pubricized:
    2022/12/06
      Vol:
    E106-A No:6
      Page(s):
    947-951

    Underwater acoustic channels (UWA) are usually sparse, which can be exploited for adaptive equalization to improve the system performance. For the shallow UWA channels, based on the proportional minimum symbol error rate (PMSER) criterion, the adaptive equalization framework requires the sparsity selection. Since the sparsity of the L0 norm is stronger than that of the L1, we choose it to achieve better convergence. However, because the L0 norm leads to NP-hard problems, it is difficult to find an efficient solution. In order to solve this problem, we choose the Gaussian function to approximate the L0 norm. Simulation results show that the proposed scheme obtains better performance than the L1 based counterpart.

  • Intelligent Tool Condition Monitoring Based on Multi-Scale Convolutional Recurrent Neural Network

    Xincheng CAO  Bin YAO  Binqiang CHEN  Wangpeng HE  Suqin GUO  Kun CHEN  

     
    PAPER-Smart Industry

      Pubricized:
    2022/06/16
      Vol:
    E106-D No:5
      Page(s):
    644-652

    Tool condition monitoring is one of the core tasks of intelligent manufacturing in digital workshop. This paper presents an intelligent recognize method of tool condition based on deep learning. First, the industrial microphone is used to collect the acoustic signal during machining; then, a central fractal decomposition algorithm is proposed to extract sensitive information; finally, the multi-scale convolutional recurrent neural network is used for deep feature extraction and pattern recognition. The multi-process milling experiments proved that the proposed method is superior to the existing methods, and the recognition accuracy reached 88%.

  • Comparative Evaluation of Diverse Features in Fluency Evaluation of Spontaneous Speech

    Huaijin DENG  Takehito UTSURO  Akio KOBAYASHI  Hiromitsu NISHIZAKI  

     
    PAPER-Speech and Hearing

      Pubricized:
    2022/10/25
      Vol:
    E106-D No:1
      Page(s):
    36-45

    There have been lots of previous studies on fluency evaluation of spontaneous speech. However, most of them focus on lexical cues, and little emphasis is placed on how diverse acoustic features and deep end-to-end models contribute to improving the performance. In this paper, we describe multi-layer neural network to investigate not only lexical features extracted from transcription, but also consider utterance-level acoustic features from audio data. We also conduct the experiments to investigate the performance of end-to-end approaches with mel-spectrogram in this task. As the speech fluency evaluation task, we evaluate our proposed method in two binary classification tasks of fluent speech detection and disfluent speech detection. Speech data of around 10 seconds duration each with the annotation of the three classes of “fluent,” “neutral,” and “disfluent” is used for evaluation. According to the two way splits of those three classes, the task of fluent speech detection is defined as binary classification of fluent vs. neutral and disfluent, while that of disfluent speech detection is defined as binary classification of fluent and neutral vs. disfluent. We then conduct experiments with the purpose of comparative evaluation of multi-layer neural network with diverse features as well as end-to-end models. For the fluent speech detection, in the comparison of utterance-level disfluency-based, prosodic, and acoustic features with multi-layer neural network, disfluency-based and prosodic features only are better. More specifically, the performance improved a lot when removing all of the acoustic features from the full set of features, while the performance is damaged a lot if fillers related features are removed. Overall, however, the end-to-end Transformer+VGGNet model with mel-spectrogram achieves the best results. For the disfluent speech detection, the multi-layer neural network using disfluency-based, prosodic, and acoustic features without fillers achieves the best results. The end-to-end Transformer+VGGNet architecture also obtains high scores, whereas it is exceeded by the best results with the multi-layer neural network with significant difference. Thus, unlike in the fluent speech detection, disfluency-based and prosodic features other than fillers are still necessary in the disfluent speech detection.

  • Polar Coding Aided by Adaptive Channel Equalization for Underwater Acoustic Communication

    Feng LIU  Qianqian WU  Conggai LI  Fangjiong CHEN  Yanli XU  

     
    LETTER-Communication Theory and Signals

      Pubricized:
    2022/07/01
      Vol:
    E106-A No:1
      Page(s):
    83-87

    To improve the performance of underwater acoustic communications, this letter proposes a polar coding scheme with adaptive channel equalization, which can reduce the amount of feedback information. Furthermore, a hybrid automatic repeat request (HARQ) mechanism is provided to mitigate the impact of estimation errors. Simulation results show that the proposed scheme outperforms the turbo equalization in bit error rate. Computational complexity analysis is also provided for comparison.

  • Experimental Study on Synchronization of Van der Pol Oscillator Circuit by Noise Sounds

    Taiki HAYASHI  Kazuyoshi ISHIMURA  Isao T. TOKUDA  

     
    PAPER-Nonlinear Problems

      Pubricized:
    2022/05/16
      Vol:
    E105-A No:11
      Page(s):
    1486-1492

    Towards realization of a noise-induced synchronization in a natural environment, an experimental study is carried out using the Van der Pol oscillator circuit. We focus on acoustic sounds as a potential source of noise that may exist in nature. To mimic such a natural environment, white noise sounds were generated from a loud speaker and recorded into microphone signals. These signals were then injected into the oscillator circuits. We show that the oscillator circuits spontaneously give rise to synchronized dynamics when the microphone signals are highly correlated with each other. As the correlation among the input microphone signals is decreased, the level of synchrony is lowered monotonously, implying that the input correlation is the key determinant for the noise-induced synchronization. Our study provides an experimental basis for synchronizing clocks in distributed sensor networks as well as other engineering devices in natural environment.

  • Analysis of Instantaneous Acoustic Fields Using Fast Inverse Laplace Transform Open Access

    Seiya KISHIMOTO  Naoya ISHIKAWA  Shinichiro OHNUKI  

     
    BRIEF PAPER

      Pubricized:
    2022/03/14
      Vol:
    E105-C No:11
      Page(s):
    700-703

    In this study, a computational method is proposed for acoustic field analysis tasks that require lengthy observation times. The acoustic fields at a given observation time are obtained using a fast inverse Laplace transform with a finite-difference complex-frequency-domain. The transient acoustic field can be evaluated at arbitrary sampling intervals by obtaining the instantaneous acoustic field at the desired observation time using the proposed method.

  • An Underwater DOA Estimation Method under Unknown Acoustic Velocity with L-Shaped Array for Wide-Band Signals

    Gengxin NING  Yushen LIN  Shenjie JIANG  Jun ZHANG  

     
    PAPER-Digital Signal Processing

      Pubricized:
    2022/03/09
      Vol:
    E105-A No:9
      Page(s):
    1289-1297

    The performance of conventional direction of arrival (DOA) methods is susceptible to the uncertainty of acoustic velocity in the underwater environment. To solve this problem, an underwater DOA estimation method with L-shaped array for wide-band signals under unknown acoustic velocity is proposed in this paper. The proposed method refers to the idea of incoherent signal subspace method and Root-MUSIC to obtain two sets of average roots corresponding to the subarray of the L-shaped array. And the geometric relationship between two vertical linear arrays is employed to derive the expression of DOA estimation with respect to the two average roots. The acoustic velocity variable in the DOA estimation expression can be eliminated in the proposed method. The simulation results demonstrate that the proposed method is more accurate and robust than other methods in an unknown acoustic velocity environment.

  • Fast Gated Recurrent Network for Speech Synthesis

    Bima PRIHASTO  Tzu-Chiang TAI  Pao-Chi CHANG  Jia-Ching WANG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2022/06/10
      Vol:
    E105-D No:9
      Page(s):
    1634-1638

    The recurrent neural network (RNN) has been used in audio and speech processing, such as language translation and speech recognition. Although RNN-based architecture can be applied to speech synthesis, the long computing time is still the primary concern. This research proposes a fast gated recurrent neural network, a fast RNN-based architecture, for speech synthesis based on the minimal gated unit (MGU). Our architecture removes the unit state history from some equations in MGU. Our MGU-based architecture is about twice faster, with equally good sound quality than the other MGU-based architectures.

  • Label-Adversarial Jointly Trained Acoustic Word Embedding

    Zhaoqi LI  Ta LI  Qingwei ZHAO  Pengyuan ZHANG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2022/05/20
      Vol:
    E105-D No:8
      Page(s):
    1501-1505

    Query-by-example spoken term detection (QbE-STD) is a task of using speech queries to match utterances, and the acoustic word embedding (AWE) method of generating fixed-length representations for speech segments has shown high performance and efficiency in recent work. We propose an AWE training method using a label-adversarial network to reduce the interference information learned during AWE training. Experiments demonstrate that our method achieves significant improvements on multilingual and zero-resource test sets.

  • Preparation Copper Sulfide Nanoparticles by Laser Ablation in Liquid and Optical Properties

    Kazuki ISODA  Ryuga YANAGIHARA  Yoshitaka KITAMOTO  Masahiko HARA  Hiroyuki WADA  

     
    BRIEF PAPER-Ultrasonic Electronics

      Pubricized:
    2021/02/08
      Vol:
    E104-C No:8
      Page(s):
    390-393

    Copper sulfide nanoparticles were successfully prepared by laser ablation in liquid. CuS powders in deionized water were irradiated with nanosecond-pulsed laser (Nd:YAG, SHG) to prepare nanoparticles. Prepared nanoparticles were investigated by scanning electron microscopy (SEM), dynamic light scattering (DLS) and fluorospectrometer. According to the results of SEM and DLS, the primary and secondary particle size was decreased with the increase in laser fluence of laser ablation in liquid. The ratio of Cu and S of prepared nanoparticles were not changed. The absorbance of prepared copper sulfide nanoparticles in water was increased with the increase in laser fluence.

  • Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS

    Kiyoshi KURIHARA  Nobumasa SEIYAMA  Tadashi KUMANO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/11/09
      Vol:
    E104-D No:2
      Page(s):
    302-311

    This paper describes a method to control prosodic features using phonetic and prosodic symbols as input of attention-based sequence-to-sequence (seq2seq) acoustic modeling (AM) for neural text-to-speech (TTS). The method involves inserting a sequence of prosodic symbols between phonetic symbols that are then used to reproduce prosodic acoustic features, i.e. accents, pauses, accent breaks, and sentence endings, in several seq2seq AM methods. The proposed phonetic and prosodic labels have simple descriptions and a low production cost. By contrast, the labels of conventional statistical parametric speech synthesis methods are complicated, and the cost of time alignments such as aligning the boundaries of phonemes is high. The proposed method does not need the boundary positions of phonemes. We propose an automatic conversion method for conventional labels and show how to automatically reproduce pitch accents and phonemes. The results of objective and subjective evaluations show the effectiveness of our method.

  • Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

    Noriyuki TONAMI  Keisuke IMOTO  Ryosuke YAMANISHI  Yoichi YAMASHITA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/11/19
      Vol:
    E104-D No:2
      Page(s):
    294-301

    Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene “office,” the sound events “mouse clicking” and “keyboard typing” are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.

  • Graph Cepstrum: Spatial Feature Extracted from Partially Connected Microphones

    Keisuke IMOTO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2019/12/09
      Vol:
    E103-D No:3
      Page(s):
    631-638

    In this paper, we propose an effective and robust method of spatial feature extraction for acoustic scene analysis utilizing partially synchronized and/or closely located distributed microphones. In the proposed method, a new cepstrum feature utilizing a graph-based basis transformation to extract spatial information from distributed microphones, while taking into account whether any pairs of microphones are synchronized and/or closely located, is introduced. Specifically, in the proposed graph-based cepstrum, the log-amplitude of a multichannel observation is converted to a feature vector utilizing the inverse graph Fourier transform, which is a method of basis transformation of a signal on a graph. Results of experiments using real environmental sounds show that the proposed graph-based cepstrum robustly extracts spatial information with consideration of the microphone connections. Moreover, the results indicate that the proposed method more robustly classifies acoustic scenes than conventional spatial features when the observed sounds have a large synchronization mismatch between partially synchronized microphone groups.

  • Automatic Construction of a Large-Scale Speech Recognition Database Using Multi-Genre Broadcast Data with Inaccurate Subtitle Timestamps

    Jeong-Uk BANG  Mu-Yeol CHOI  Sang-Hun KIM  Oh-Wook KWON  

     
    PAPER-Speech and Hearing

      Pubricized:
    2019/11/13
      Vol:
    E103-D No:2
      Page(s):
    406-415

    As deep learning-based speech recognition systems are spotlighted, the need for large-scale speech databases for acoustic model training is increasing. Broadcast data can be easily used for database construction, since it contains transcripts for the hearing impaired. However, the subtitle timestamps have not been used to extract speech data because they are often inaccurate due to the inherent characteristics of closed captioning. Thus, we propose to build a large-scale speech database from multi-genre broadcast data with inaccurate subtitle timestamps. The proposed method first extracts the most likely speech intervals by removing subtitle texts with low subtitle quality index, concatenating adjacent subtitle texts into a merged subtitle text, and adding a margin to the timestamp of the merged subtitle text. Next, a speech recognizer is used to extract a hypothesis text of a speech segment corresponding to the merged subtitle text, and then the hypothesis text obtained from the decoder is recursively aligned with the merged subtitle text. Finally, the speech database is constructed by selecting the sub-parts of the merged subtitle text that match the hypothesis text. Our method successfully refines a large amount of broadcast data with inaccurate subtitle timestamps, taking about half of the time compared with the previous methods. Consequently, our method is useful for broadcast data processing, where bulk speech data can be collected every hour.

  • Acoustic Design Support System of Compact Enclosure for Smartphone Using Deep Neural Network

    Kai NAKAMURA  Kenta IWAI  Yoshinobu KAJIKAWA  

     
    PAPER-Engineering Acoustics

      Vol:
    E102-A No:12
      Page(s):
    1932-1939

    In this paper, we propose an automatic design support system for compact acoustic devices such as microspeakers inside smartphones. The proposed design support system outputs the dimensions of compact acoustic devices with the desired acoustic characteristic. This system uses a deep neural network (DNN) to obtain the relationship between the frequency characteristic of the compact acoustic device and its dimensions. The training data are generated by the acoustic finite-difference time-domain (FDTD) method so that many training data can be easily obtained. We demonstrate the effectiveness of the proposed system through some comparisons between desired and designed frequency characteristics.

1-20hit(178hit)