The search functionality is under construction.

Keyword Search Result

[Keyword] audio(101hit)

1-20hit(101hit)

  • An Intra- and Inter-Emotion Transformer-Based Fusion Model with Homogeneous and Diverse Constraints Using Multi-Emotional Audiovisual Features for Depression Detection

    Shiyu TENG  Jiaqing LIU  Yue HUANG  Shurong CHAI  Tomoko TATEYAMA  Xinyin HUANG  Lanfen LIN  Yen-Wei CHEN  

     
    PAPER

      Pubricized:
    2023/12/15
      Vol:
    E107-D No:3
      Page(s):
    342-353

    Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.

  • Deep Multiplicative Update Algorithm for Nonnegative Matrix Factorization and Its Application to Audio Signals

    Hiroki TANJI  Takahiro MURAKAMI  

     
    PAPER-Digital Signal Processing

      Pubricized:
    2023/01/19
      Vol:
    E106-A No:7
      Page(s):
    962-975

    The design and adjustment of the divergence in audio applications using nonnegative matrix factorization (NMF) is still open problem. In this study, to deal with this problem, we explore a representation of the divergence using neural networks (NNs). Instead of the divergence, our approach extends the multiplicative update algorithm (MUA), which estimates the NMF parameters, using NNs. The design of the extended MUA incorporates NNs, and the new algorithm is referred to as the deep MUA (DeMUA) for NMF. While the DeMUA represents the algorithm for the NMF, interestingly, the divergence is obtained from the incorporated NN. In addition, we propose theoretical guides to design the incorporated NN such that it can be interpreted as a divergence. By appropriately designing the NN, MUAs based on existing divergences with a single hyper-parameter can be represented by the DeMUA. To train the DeMUA, we applied it to audio denoising and supervised signal separation. Our experimental results show that the proposed architecture can learn the MUA and the divergences in sparse denoising and speech separation tasks and that the MUA based on generalized divergences with multiple parameters shows favorable performances on these tasks.

  • A Non-Intrusive Speech Quality Evaluation Method Based on the Audiogram and Weighted Frequency Information for Hearing Aid

    Ruxue GUO  Pengxu JIANG  Ruiyu LIANG  Yue XIE  Cairong ZOU  

     
    LETTER-Speech and Hearing

      Pubricized:
    2022/07/25
      Vol:
    E106-A No:1
      Page(s):
    64-68

    For a long time, the compensation effect of hearing aid is mainly evaluated subjectively, and there are fewer studies of objective evaluation. Furthermore, a pure speech signal is generally required as a reference in the existing objective evaluation methods, which restricts the practicality in a real-world environment. Therefore, this paper presents a non-intrusive speech quality evaluation method for hearing aid, which combines the audiogram and weighted frequency information. The proposed model mainly includes an audiogram information extraction network, a frequency information extraction network, and a quality score mapping network. The audiogram is the input of the audiogram information extraction network, which helps the system capture the information related to hearing loss. In addition, the low-frequency bands of speech contain loudness information and the medium and high-frequency components contribute to semantic comprehension. The information of two frequency bands is input to the frequency information extraction network to obtain time-frequency information. When obtaining the high-level features of different frequency bands and audiograms, they are fused into two groups of tensors that distinguish the information of different frequency bands and used as the input of the attention layer to calculate the corresponding weight distribution. Finally, a dense layer is employed to predict the score of speech quality. The experimental results show that it is reasonable to combine the audiogram and the weight of the information from two frequency bands, which can effectively realize the evaluation of the speech quality of the hearing aid.

  • Supervised Audio Source Separation Based on Nonnegative Matrix Factorization with Cosine Similarity Penalty Open Access

    Yuta IWASE  Daichi KITAMURA  

     
    PAPER-Engineering Acoustics

      Pubricized:
    2021/12/08
      Vol:
    E105-A No:6
      Page(s):
    906-913

    In this study, we aim to improve the performance of audio source separation for monaural mixture signals. For monaural audio source separation, semisupervised nonnegative matrix factorization (SNMF) can achieve higher separation performance by employing small supervised signals. In particular, penalized SNMF (PSNMF) with orthogonality penalty is an effective method. PSNMF forces two basis matrices for target and nontarget sources to be orthogonal to each other and improves the separation accuracy. However, the conventional orthogonality penalty is based on an inner product and does not affect the estimation of the basis matrix properly because of the scale indeterminacy between the basis and activation matrices in NMF. To cope with this problem, a new PSNMF with cosine similarity between the basis matrices is proposed. The experimental comparison shows the efficacy of the proposed cosine similarity penalty in supervised audio source separation.

  • Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments

    Jing WANG  Yiyu LUO  Weiming YI  Xiang XIE  

     
    PAPER-Speech and Hearing

      Pubricized:
    2022/01/11
      Vol:
    E105-D No:4
      Page(s):
    766-777

    Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audio-visual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.

  • Master-Teacher-Student: A Weakly Labelled Semi-Supervised Framework for Audio Tagging and Sound Event Detection

    Yuzhuo LIU  Hangting CHEN  Qingwei ZHAO  Pengyuan ZHANG  

     
    LETTER-Speech and Hearing

      Pubricized:
    2022/01/13
      Vol:
    E105-D No:4
      Page(s):
    828-831

    Weakly labelled semi-supervised audio tagging (AT) and sound event detection (SED) have become significant in real-world applications. A popular method is teacher-student learning, making student models learn from pseudo-labels generated by teacher models from unlabelled data. To generate high-quality pseudo-labels, we propose a master-teacher-student framework trained with a dual-lead policy. Our experiments illustrate that our model outperforms the state-of-the-art model on both tasks.

  • User-Assisted QoS Control for QoE Enhancement in Audiovisual and Haptic Interactive IP Communications

    Toshiro NUNOME  Suguru KAEDE  Shuji TASAKA  

     
    PAPER-Network

      Pubricized:
    2020/04/21
      Vol:
    E103-B No:10
      Page(s):
    1107-1116

    In this paper, we propose a user-assisted QoS control scheme that utilizes media adaptive buffering to enhance QoE of audiovisual and haptic IP communications. The scheme consists of two modes: a manual mode and an automatic mode. It enables users to switch between these two modes according to their inclinations. We compare four QoS control schemes: the manual mode only, the automatic mode only, the switching scheme starting with the manual mode, and the switching scheme starting with the automatic mode. We assess the effects of the four schemes, user attributes, and tasks on QoE through a subjective experiment which provides information on users' behavior in addition to QoE scores. As a result of the experiment, we show that the user-assisted QoS control scheme can enhance QoE. Furthermore, we notice that the proper QoS control scheme depends on user attributes and tasks.

  • Transferring Adaptive Bit Rate Streaming Quality Models from H.264/HD to H.265/4K-UHD Open Access

    Pierre LEBRETON  Kazuhisa YAMAGISHI  

     
    PAPER-Network

      Pubricized:
    2019/06/25
      Vol:
    E102-B No:12
      Page(s):
    2226-2242

    In this paper the quality of adaptive bit rate video streaming is investigated and two state-of-the-art models, i.e., the NTT audiovisual quality-estimation and ITU-T P.1203 models, are considered. This paper shows how these models can be applied to new conditions, e.g., 4K ultra high definition (4K-UHD) videos encoded using H.265, considering that they were originally designed and trained for HD videos encoded with H.264. Six subjective evaluations involving up to 192 participants and a large variety of test conditions, e.g., durations from 10sec to 3min, coding-quality variation, and stalling events, were conducted on both TV and mobile devices. Using the subjective data, this paper addresses how models and coefficients can be transferred to new conditions. A comparison between state-of-the-art models is conducted, showing the performance of transferred and retrained models. It is found that other video-quality estimation models, such as VMAF, can be used as input of the NTT and ITU-T P.1203 long-term pooling modules, allowing these other video-quality-estimation models to support the specificities of adaptive bit-rate-streaming scenarios. Finally, all retrained coefficients are detailed in this paper allowing future work to directly reuse the results of this study.

  • Construction of Spontaneous Emotion Corpus from Indonesian TV Talk Shows and Its Application on Multimodal Emotion Recognition

    Nurul LUBIS  Dessi LESTARI  Sakriani SAKTI  Ayu PURWARIANTI  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2018/05/10
      Vol:
    E101-D No:8
      Page(s):
    2092-2100

    As interaction between human and computer continues to develop to the most natural form possible, it becomes increasingly urgent to incorporate emotion in the equation. This paper describes a step toward extending the research on emotion recognition to Indonesian. The field continues to develop, yet exploration of the subject in Indonesian is still lacking. In particular, this paper highlights two contributions: (1) the construction of the first emotional audio-visual database in Indonesian, and (2) the first multimodal emotion recognizer in Indonesian, built from the aforementioned corpus. In constructing the corpus, we aim at natural emotions that are corresponding to real life occurrences. However, the collection of emotional corpora is notably labor intensive and expensive. To diminish the cost, we collect the emotional data from television programs recordings, eliminating the need of an elaborate recording set up and experienced participants. In particular, we choose television talk shows due to its natural conversational content, yielding spontaneous emotion occurrences. To cover a broad range of emotions, we collected three episodes in different genres: politics, humanity, and entertainment. In this paper, we report points of analysis of the data and annotations. The acquisition of the emotion corpus serves as a foundation in further research on emotion. Subsequently, in the experiment, we employ the support vector machine (SVM) algorithm to model the emotions in the collected data. We perform multimodal emotion recognition utilizing the predictions of three modalities: acoustic, semantic, and visual. When compared to the unimodal result, in the multimodal feature combination, we attain identical accuracy for the arousal at 92.6%, and a significant improvement for the valence classification task at 93.8%. We hope to continue this work and move towards a finer-grain, more precise quantification of emotion.

  • Enhancement of Video Streaming QoE by Considering Burst Loss in Wireless LANs

    Toshiro NUNOME  Yuta MATSUI  

     
    PAPER

      Pubricized:
    2018/01/22
      Vol:
    E101-B No:7
      Page(s):
    1653-1660

    In order to enhance QoE of audio and video IP transmission, this paper proposes a method for mitigating the spatial quality impairment during burst loss periods over the wireless networks in the video output scheme SCS, which is a QoE-based video output scheme. SCS switches between two common video output schemes: frame skipping and error concealment. The proposed method pauses video output with an undamaged frame during the burst loss period in order not to pause video output on a degraded frame. We perform an experiment with constant thresholds, the table-lookup method, and the proposed method under various network conditions. The result shows that the effect of the proposed method on QoE can differ with the contents and GOP structures.

  • QoE Enhancement of Audio-Video Reliable Groupcast with IEEE 802.11aa

    Toshiro NUNOME  Takuya KOMATSU  

     
    PAPER

      Pubricized:
    2018/01/22
      Vol:
    E101-B No:7
      Page(s):
    1645-1652

    This paper enhances the QoE of audio and video multicast transmission over a wireless LAN by means of reliable groupcast schemes. We use GCR (GroupCast with Retries) Unsolicited Retry and GCR Block ACK as reliable groupcast schemes; they are standardized by IEEE 802.11aa. We assume that a wireless access point transmits audio and video streams to several terminals connected to the access point by groupcast. We compare three schemes: Groupcast with EDCA (Enhanced Distributed Channel Access), GCR Unsolicited Retry and GCR Block ACK. We perform computer simulations under various network conditions to assess application-level QoS and evaluate QoE by a subjective experiment. As a result, we find that the most effective scheme depends on network conditions.

  • Multiple Matrix Rank Minimization Approach to Audio Declipping

    Ryohei SASAKI  Katsumi KONISHI  Tomohiro TAKAHASHI  Toshihiro FURUKAWA  

     
    LETTER-Speech and Hearing

      Pubricized:
    2017/12/06
      Vol:
    E101-D No:3
      Page(s):
    821-825

    This letter deals with an audio declipping problem and proposes a multiple matrix rank minimization approach. We assume that short-time audio signals satisfy the autoregressive (AR) model and formulate the declipping problem as a multiple matrix rank minimization problem. To solve this problem, an iterative algorithm is provided based on the iterative partial matrix shrinkage (IPMS) algorithm. Numerical examples show its efficiency.

  • Tolerance Evaluation of Audio Watermarking Method Based on Modification of Sound Pressure Level between Channels

    Harumi MURATA  Akio OGIHARA  Shigetoshi HAYASHI  

     
    LETTER

      Pubricized:
    2017/10/16
      Vol:
    E101-D No:1
      Page(s):
    68-71

    We have proposed an audio watermarking method based on modification of sound pressure level between channels. This method is focused on the invariability of sound localization against sound processing like MP3 and the imperceptibility about slightly change of sound localization. In this paper, we investigate about tolerance evaluation against various attacks in reference to IHC criteria.

  • A Single-Dimensional Interface for Arranging Multiple Audio Sources in Three-Dimensional Space

    Kento OHTANI  Kenta NIWA  Kazuya TAKEDA  

     
    PAPER-Music Information Processing

      Pubricized:
    2017/06/26
      Vol:
    E100-D No:10
      Page(s):
    2635-2643

    A single-dimensional interface which enables users to obtain diverse localizations of audio sources is proposed. In many conventional interfaces for arranging audio sources, there are multiple arrangement parameters, some of which allow users to control positions of audio sources. However, it is difficult for users who are unfamiliar with these systems to optimize the arrangement parameters since the number of possible settings is huge. We propose a simple, single-dimensional interface for adjusting arrangement parameters, allowing users to sample several diverse audio source arrangements and easily find their preferred auditory localizations. To select subsets of arrangement parameters from all of the possible choices, auditory-localization space vectors (ASVs) are defined to represent the auditory localization of each arrangement parameter. By selecting subsets of ASVs which are approximately orthogonal, we can choose arrangement parameters which will produce diverse auditory localizations. Experimental evaluations were conducted using music composed of three audio sources. Subjective evaluations confirmed that novice users can obtain diverse localizations using the proposed interface.

  • Low-Complexity Recursive-Least-Squares-Based Online Nonnegative Matrix Factorization Algorithm for Audio Source Separation

    Seokjin LEE  

     
    LETTER-Music Information Processing

      Pubricized:
    2017/02/06
      Vol:
    E100-D No:5
      Page(s):
    1152-1156

    An online nonnegative matrix factorization (NMF) algorithm based on recursive least squares (RLS) is described in a matrix form, and a simplified algorithm for a low-complexity calculation is developed for frame-by-frame online audio source separation system. First, the online NMF algorithm based on the RLS method is described as solving the NMF problem recursively. Next, a simplified algorithm is developed to approximate the RLS-based online NMF algorithm with low complexity. The proposed algorithm is evaluated in terms of audio source separation, and the results show that the performance of the proposed algorithms are superior to that of the conventional online NMF algorithm with significantly reduced complexity.

  • A Resilience Mask for Robust Audio Hashing

    Jin S. SEO  

     
    LETTER

      Pubricized:
    2016/10/07
      Vol:
    E100-D No:1
      Page(s):
    57-60

    Audio hashing has been successfully employed for protection, management, and indexing of digital music archives. For a reliable audio hashing system, improving hash matching accuracy is crucial. In this paper, we try to improve a binary audio hash matching performance by utilizing auxiliary information, resilience mask, which is obtained while constructing hash DB. The resilience mask contains reliability information of each hash bit. We propose a new type of resilience mask by considering spectrum scaling and additive noise distortions. Experimental results show that the proposed resilience mask is effective in improving hash matching performance.

  • Information Hiding and Its Criteria for Evaluation Open Access

    Keiichi IWAMURA  Masaki KAWAMURA  Minoru KURIBAYASHI  Motoi IWATA  Hyunho KANG  Seiichi GOHSHI  Akira NISHIMURA  

     
    INVITED PAPER

      Pubricized:
    2016/10/07
      Vol:
    E100-D No:1
      Page(s):
    2-12

    Within information hiding technology, digital watermarking is one of the most important technologies for copyright protection of digital content. Many digital watermarking schemes have been proposed in academia. However, these schemes are not used, because they are not practical; one reason for this is that the evaluation criteria are loosely defined. To make the evaluation more concrete and improve the practicality of digital watermarking, watermarking schemes must use common evaluation criteria. To realize such criteria, we organized the Information Hiding and its Criteria for Evaluation (IHC) Committee to create useful, globally accepted evaluation criteria for information hiding technology. The IHC Committee improves their evaluation criteria every year, and holds a competition for digital watermarking based on state-of-the-art evaluation criteria. In this paper, we describe the activities of the IHC Committee and its evaluation criteria for digital watermarking of still images, videos, and audio.

  • Investigation of DNN-Based Audio-Visual Speech Recognition

    Satoshi TAMURA  Hiroshi NINOMIYA  Norihide KITAOKA  Shin OSUGA  Yurie IRIBE  Kazuya TAKEDA  Satoru HAYAMIZU  

     
    PAPER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2444-2451

    Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.

  • Efficient Residual Coding Method of Spatial Audio Object Coding with Two-Step Coding Structure for Interactive Audio Services

    Byonghwa LEE  Kwangki KIM  Minsoo HAHN  

     
    LETTER-Speech and Hearing

      Pubricized:
    2016/04/08
      Vol:
    E99-D No:7
      Page(s):
    1949-1952

    In interactive audio services, users can render audio objects rather freely to match their desires and the spatial audio object coding (SAOC) scheme is fairly good both in the sense of bitrate and audio quality. But rather perceptible audio quality degradation can occur when an object is suppressed or played alone. To complement this, the SAOC scheme with Two-Step Coding (SAOC-TSC) was proposed. But the bitrate of the side information increases two times compared to that of the original SAOC due to the bitrate needed for the residual coding used to enhance the audio quality. In this paper, an efficient residual coding method of the SAOC-TSC is proposed to reduce the side information bitrate without audio quality degradation or complexity increase.

  • Method of Audio Watermarking Based on Adaptive Phase Modulation

    Nhut Minh NGO  Masashi UNOKI  

     
    PAPER

      Pubricized:
    2015/10/21
      Vol:
    E99-D No:1
      Page(s):
    92-101

    This paper proposes a method of watermarking for digital audio signals based on adaptive phase modulation. Audio signals are usually non-stationary, i.e., their own characteristics are time-variant. The features for watermarking are usually not selected by combining the principle of variability, which affects the performance of the whole watermarking system. The proposed method embeds a watermark into an audio signal by adaptively modulating its phase with the watermark using IIR all-pass filters. The frequency location of the pole-zero of an IIR all-pass filter that characterizes the transfer function of the filter is adapted on the basis of signal power distribution on sub-bands in a magnitude spectrum domain. The pole-zero locations are adapted so that the phase modulation produces slight distortion in watermarked signals to achieve the best sound quality. The experimental results show that the proposed method could embed inaudible watermarks into various kinds of audio signals and correctly detect watermarks without the aid of original signals. A reasonable trade-off between inaudibility and robustness could be obtained by balancing the phase modulation scheme. The proposed method can embed a watermark into audio signals up to 100 bits per second with 99% accuracy and 6 bits per second with 94.3% accuracy in the cases of no attack and attacks, respectively.

1-20hit(101hit)