Shiyu TENG Jiaqing LIU Yue HUANG Shurong CHAI Tomoko TATEYAMA Xinyin HUANG Lanfen LIN Yen-Wei CHEN
Depression is a prevalent mental disorder affecting a significant portion of the global population, leading to considerable disability and contributing to the overall burden of disease. Consequently, designing efficient and robust automated methods for depression detection has become imperative. Recently, deep learning methods, especially multimodal fusion methods, have been increasingly used in computer-aided depression detection. Importantly, individuals with depression and those without respond differently to various emotional stimuli, providing valuable information for detecting depression. Building on these observations, we propose an intra- and inter-emotional stimulus transformer-based fusion model to effectively extract depression-related features. The intra-emotional stimulus fusion framework aims to prioritize different modalities, capitalizing on their diversity and complementarity for depression detection. The inter-emotional stimulus model maps each emotional stimulus onto both invariant and specific subspaces using individual invariant and specific encoders. The emotional stimulus-invariant subspace facilitates efficient information sharing and integration across different emotional stimulus categories, while the emotional stimulus specific subspace seeks to enhance diversity and capture the distinct characteristics of individual emotional stimulus categories. Our proposed intra- and inter-emotional stimulus fusion model effectively integrates multimodal data under various emotional stimulus categories, providing a comprehensive representation that allows accurate task predictions in the context of depression detection. We evaluate the proposed model on the Chinese Soochow University students dataset, and the results outperform state-of-the-art models in terms of concordance correlation coefficient (CCC), root mean squared error (RMSE) and accuracy.
Hiroki TANJI Takahiro MURAKAMI
The design and adjustment of the divergence in audio applications using nonnegative matrix factorization (NMF) is still open problem. In this study, to deal with this problem, we explore a representation of the divergence using neural networks (NNs). Instead of the divergence, our approach extends the multiplicative update algorithm (MUA), which estimates the NMF parameters, using NNs. The design of the extended MUA incorporates NNs, and the new algorithm is referred to as the deep MUA (DeMUA) for NMF. While the DeMUA represents the algorithm for the NMF, interestingly, the divergence is obtained from the incorporated NN. In addition, we propose theoretical guides to design the incorporated NN such that it can be interpreted as a divergence. By appropriately designing the NN, MUAs based on existing divergences with a single hyper-parameter can be represented by the DeMUA. To train the DeMUA, we applied it to audio denoising and supervised signal separation. Our experimental results show that the proposed architecture can learn the MUA and the divergences in sparse denoising and speech separation tasks and that the MUA based on generalized divergences with multiple parameters shows favorable performances on these tasks.
Ruxue GUO Pengxu JIANG Ruiyu LIANG Yue XIE Cairong ZOU
For a long time, the compensation effect of hearing aid is mainly evaluated subjectively, and there are fewer studies of objective evaluation. Furthermore, a pure speech signal is generally required as a reference in the existing objective evaluation methods, which restricts the practicality in a real-world environment. Therefore, this paper presents a non-intrusive speech quality evaluation method for hearing aid, which combines the audiogram and weighted frequency information. The proposed model mainly includes an audiogram information extraction network, a frequency information extraction network, and a quality score mapping network. The audiogram is the input of the audiogram information extraction network, which helps the system capture the information related to hearing loss. In addition, the low-frequency bands of speech contain loudness information and the medium and high-frequency components contribute to semantic comprehension. The information of two frequency bands is input to the frequency information extraction network to obtain time-frequency information. When obtaining the high-level features of different frequency bands and audiograms, they are fused into two groups of tensors that distinguish the information of different frequency bands and used as the input of the attention layer to calculate the corresponding weight distribution. Finally, a dense layer is employed to predict the score of speech quality. The experimental results show that it is reasonable to combine the audiogram and the weight of the information from two frequency bands, which can effectively realize the evaluation of the speech quality of the hearing aid.
In this study, we aim to improve the performance of audio source separation for monaural mixture signals. For monaural audio source separation, semisupervised nonnegative matrix factorization (SNMF) can achieve higher separation performance by employing small supervised signals. In particular, penalized SNMF (PSNMF) with orthogonality penalty is an effective method. PSNMF forces two basis matrices for target and nontarget sources to be orthogonal to each other and improves the separation accuracy. However, the conventional orthogonality penalty is based on an inner product and does not affect the estimation of the basis matrix properly because of the scale indeterminacy between the basis and activation matrices in NMF. To cope with this problem, a new PSNMF with cosine similarity between the basis matrices is proposed. The experimental comparison shows the efficacy of the proposed cosine similarity penalty in supervised audio source separation.
Jing WANG Yiyu LUO Weiming YI Xiang XIE
Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audio-visual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.
Yuzhuo LIU Hangting CHEN Qingwei ZHAO Pengyuan ZHANG
Weakly labelled semi-supervised audio tagging (AT) and sound event detection (SED) have become significant in real-world applications. A popular method is teacher-student learning, making student models learn from pseudo-labels generated by teacher models from unlabelled data. To generate high-quality pseudo-labels, we propose a master-teacher-student framework trained with a dual-lead policy. Our experiments illustrate that our model outperforms the state-of-the-art model on both tasks.
Toshiro NUNOME Suguru KAEDE Shuji TASAKA
In this paper, we propose a user-assisted QoS control scheme that utilizes media adaptive buffering to enhance QoE of audiovisual and haptic IP communications. The scheme consists of two modes: a manual mode and an automatic mode. It enables users to switch between these two modes according to their inclinations. We compare four QoS control schemes: the manual mode only, the automatic mode only, the switching scheme starting with the manual mode, and the switching scheme starting with the automatic mode. We assess the effects of the four schemes, user attributes, and tasks on QoE through a subjective experiment which provides information on users' behavior in addition to QoE scores. As a result of the experiment, we show that the user-assisted QoS control scheme can enhance QoE. Furthermore, we notice that the proper QoS control scheme depends on user attributes and tasks.
Pierre LEBRETON Kazuhisa YAMAGISHI
In this paper the quality of adaptive bit rate video streaming is investigated and two state-of-the-art models, i.e., the NTT audiovisual quality-estimation and ITU-T P.1203 models, are considered. This paper shows how these models can be applied to new conditions, e.g., 4K ultra high definition (4K-UHD) videos encoded using H.265, considering that they were originally designed and trained for HD videos encoded with H.264. Six subjective evaluations involving up to 192 participants and a large variety of test conditions, e.g., durations from 10sec to 3min, coding-quality variation, and stalling events, were conducted on both TV and mobile devices. Using the subjective data, this paper addresses how models and coefficients can be transferred to new conditions. A comparison between state-of-the-art models is conducted, showing the performance of transferred and retrained models. It is found that other video-quality estimation models, such as VMAF, can be used as input of the NTT and ITU-T P.1203 long-term pooling modules, allowing these other video-quality-estimation models to support the specificities of adaptive bit-rate-streaming scenarios. Finally, all retrained coefficients are detailed in this paper allowing future work to directly reuse the results of this study.
Nurul LUBIS Dessi LESTARI Sakriani SAKTI Ayu PURWARIANTI Satoshi NAKAMURA
As interaction between human and computer continues to develop to the most natural form possible, it becomes increasingly urgent to incorporate emotion in the equation. This paper describes a step toward extending the research on emotion recognition to Indonesian. The field continues to develop, yet exploration of the subject in Indonesian is still lacking. In particular, this paper highlights two contributions: (1) the construction of the first emotional audio-visual database in Indonesian, and (2) the first multimodal emotion recognizer in Indonesian, built from the aforementioned corpus. In constructing the corpus, we aim at natural emotions that are corresponding to real life occurrences. However, the collection of emotional corpora is notably labor intensive and expensive. To diminish the cost, we collect the emotional data from television programs recordings, eliminating the need of an elaborate recording set up and experienced participants. In particular, we choose television talk shows due to its natural conversational content, yielding spontaneous emotion occurrences. To cover a broad range of emotions, we collected three episodes in different genres: politics, humanity, and entertainment. In this paper, we report points of analysis of the data and annotations. The acquisition of the emotion corpus serves as a foundation in further research on emotion. Subsequently, in the experiment, we employ the support vector machine (SVM) algorithm to model the emotions in the collected data. We perform multimodal emotion recognition utilizing the predictions of three modalities: acoustic, semantic, and visual. When compared to the unimodal result, in the multimodal feature combination, we attain identical accuracy for the arousal at 92.6%, and a significant improvement for the valence classification task at 93.8%. We hope to continue this work and move towards a finer-grain, more precise quantification of emotion.
In order to enhance QoE of audio and video IP transmission, this paper proposes a method for mitigating the spatial quality impairment during burst loss periods over the wireless networks in the video output scheme SCS, which is a QoE-based video output scheme. SCS switches between two common video output schemes: frame skipping and error concealment. The proposed method pauses video output with an undamaged frame during the burst loss period in order not to pause video output on a degraded frame. We perform an experiment with constant thresholds, the table-lookup method, and the proposed method under various network conditions. The result shows that the effect of the proposed method on QoE can differ with the contents and GOP structures.
This paper enhances the QoE of audio and video multicast transmission over a wireless LAN by means of reliable groupcast schemes. We use GCR (GroupCast with Retries) Unsolicited Retry and GCR Block ACK as reliable groupcast schemes; they are standardized by IEEE 802.11aa. We assume that a wireless access point transmits audio and video streams to several terminals connected to the access point by groupcast. We compare three schemes: Groupcast with EDCA (Enhanced Distributed Channel Access), GCR Unsolicited Retry and GCR Block ACK. We perform computer simulations under various network conditions to assess application-level QoS and evaluate QoE by a subjective experiment. As a result, we find that the most effective scheme depends on network conditions.
Ryohei SASAKI Katsumi KONISHI Tomohiro TAKAHASHI Toshihiro FURUKAWA
This letter deals with an audio declipping problem and proposes a multiple matrix rank minimization approach. We assume that short-time audio signals satisfy the autoregressive (AR) model and formulate the declipping problem as a multiple matrix rank minimization problem. To solve this problem, an iterative algorithm is provided based on the iterative partial matrix shrinkage (IPMS) algorithm. Numerical examples show its efficiency.
Harumi MURATA Akio OGIHARA Shigetoshi HAYASHI
We have proposed an audio watermarking method based on modification of sound pressure level between channels. This method is focused on the invariability of sound localization against sound processing like MP3 and the imperceptibility about slightly change of sound localization. In this paper, we investigate about tolerance evaluation against various attacks in reference to IHC criteria.
Kento OHTANI Kenta NIWA Kazuya TAKEDA
A single-dimensional interface which enables users to obtain diverse localizations of audio sources is proposed. In many conventional interfaces for arranging audio sources, there are multiple arrangement parameters, some of which allow users to control positions of audio sources. However, it is difficult for users who are unfamiliar with these systems to optimize the arrangement parameters since the number of possible settings is huge. We propose a simple, single-dimensional interface for adjusting arrangement parameters, allowing users to sample several diverse audio source arrangements and easily find their preferred auditory localizations. To select subsets of arrangement parameters from all of the possible choices, auditory-localization space vectors (ASVs) are defined to represent the auditory localization of each arrangement parameter. By selecting subsets of ASVs which are approximately orthogonal, we can choose arrangement parameters which will produce diverse auditory localizations. Experimental evaluations were conducted using music composed of three audio sources. Subjective evaluations confirmed that novice users can obtain diverse localizations using the proposed interface.
An online nonnegative matrix factorization (NMF) algorithm based on recursive least squares (RLS) is described in a matrix form, and a simplified algorithm for a low-complexity calculation is developed for frame-by-frame online audio source separation system. First, the online NMF algorithm based on the RLS method is described as solving the NMF problem recursively. Next, a simplified algorithm is developed to approximate the RLS-based online NMF algorithm with low complexity. The proposed algorithm is evaluated in terms of audio source separation, and the results show that the performance of the proposed algorithms are superior to that of the conventional online NMF algorithm with significantly reduced complexity.
Audio hashing has been successfully employed for protection, management, and indexing of digital music archives. For a reliable audio hashing system, improving hash matching accuracy is crucial. In this paper, we try to improve a binary audio hash matching performance by utilizing auxiliary information, resilience mask, which is obtained while constructing hash DB. The resilience mask contains reliability information of each hash bit. We propose a new type of resilience mask by considering spectrum scaling and additive noise distortions. Experimental results show that the proposed resilience mask is effective in improving hash matching performance.
Keiichi IWAMURA Masaki KAWAMURA Minoru KURIBAYASHI Motoi IWATA Hyunho KANG Seiichi GOHSHI Akira NISHIMURA
Within information hiding technology, digital watermarking is one of the most important technologies for copyright protection of digital content. Many digital watermarking schemes have been proposed in academia. However, these schemes are not used, because they are not practical; one reason for this is that the evaluation criteria are loosely defined. To make the evaluation more concrete and improve the practicality of digital watermarking, watermarking schemes must use common evaluation criteria. To realize such criteria, we organized the Information Hiding and its Criteria for Evaluation (IHC) Committee to create useful, globally accepted evaluation criteria for information hiding technology. The IHC Committee improves their evaluation criteria every year, and holds a competition for digital watermarking based on state-of-the-art evaluation criteria. In this paper, we describe the activities of the IHC Committee and its evaluation criteria for digital watermarking of still images, videos, and audio.
Satoshi TAMURA Hiroshi NINOMIYA Norihide KITAOKA Shin OSUGA Yurie IRIBE Kazuya TAKEDA Satoru HAYAMIZU
Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandem-based method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs.
Byonghwa LEE Kwangki KIM Minsoo HAHN
In interactive audio services, users can render audio objects rather freely to match their desires and the spatial audio object coding (SAOC) scheme is fairly good both in the sense of bitrate and audio quality. But rather perceptible audio quality degradation can occur when an object is suppressed or played alone. To complement this, the SAOC scheme with Two-Step Coding (SAOC-TSC) was proposed. But the bitrate of the side information increases two times compared to that of the original SAOC due to the bitrate needed for the residual coding used to enhance the audio quality. In this paper, an efficient residual coding method of the SAOC-TSC is proposed to reduce the side information bitrate without audio quality degradation or complexity increase.
This paper proposes a method of watermarking for digital audio signals based on adaptive phase modulation. Audio signals are usually non-stationary, i.e., their own characteristics are time-variant. The features for watermarking are usually not selected by combining the principle of variability, which affects the performance of the whole watermarking system. The proposed method embeds a watermark into an audio signal by adaptively modulating its phase with the watermark using IIR all-pass filters. The frequency location of the pole-zero of an IIR all-pass filter that characterizes the transfer function of the filter is adapted on the basis of signal power distribution on sub-bands in a magnitude spectrum domain. The pole-zero locations are adapted so that the phase modulation produces slight distortion in watermarked signals to achieve the best sound quality. The experimental results show that the proposed method could embed inaudible watermarks into various kinds of audio signals and correctly detect watermarks without the aid of original signals. A reasonable trade-off between inaudibility and robustness could be obtained by balancing the phase modulation scheme. The proposed method can embed a watermark into audio signals up to 100 bits per second with 99% accuracy and 6 bits per second with 94.3% accuracy in the cases of no attack and attacks, respectively.