1-10hit |
Deokgyu YUN Hannah LEE Seung Ho CHOI
This paper proposes a deep learning-based non-intrusive objective speech intelligibility estimation method based on recurrent neural network (RNN) with long short-term memory (LSTM) structure. Conventional non-intrusive estimation methods such as standard P.563 have poor estimation performance and lack of consistency, especially, in various noise and reverberation environments. The proposed method trains the LSTM RNN model parameters by utilizing the STOI that is the standard intrusive intelligibility estimation method with reference speech signal. The input and output of the LSTM RNN are the MFCC vector and the frame-wise STOI value, respectively. Experimental results show that the proposed objective intelligibility estimation method outperforms the conventional standard P.563 in various noisy and reverberant environments.
Toshihiro SAKANO Yosuke KOBAYASHI Kazuhiro KONDO
We proposed and evaluated a speech intelligibility estimation method that does not require a clean speech reference signal. The propose method uses the features defined in the ITU-T standard P.563, which estimates the overall quality of speech without the reference signal. We selected two sets of features from the P.563 features; the basic 9-feature set, which includes basic features that characterize both speech and background noise, e.g., cepstrum skewness and LPC kurtosis, and the extended 31-feature set with 22 additional features for a more accurate description of the degraded speech and noise, e.g., SNR, average pitch, and spectral clarity among others. Four hundred noise samples were added to speech, and about 70% of these samples were used to train a support vector regression (SVR) model. The trained models were used to estimate the intelligibility of speech degraded by added noise. The proposed method showed a root mean square error (RMSE) value of about 10% and correlation with subjective intelligibility of about 0.93 for speech distorted with known noise type, and RMSE of about 16% and a correlation of about 0.84 for speech distorted with unknown noise type, both with either the 9 or the 31-dimension feature set. These results were higher than the estimation using frequency-weighed SNR calculated in critical frequency bands, which requires the clean reference signal for its calculation. We believe this level of accuracy proves the proposed method to be applicable to real-time speech quality monitoring in the field.
Kazuhiro KONDO Naoya ANAZAWA Yosuke KOBAYASHI
We compared two audio output devices for augmented audio reality applications. In these applications, we plan to use speech annotations on top of the actual ambient environment. Thus, it becomes essential that these audio output devices are able to deliver intelligible speech annotation along with transparent delivery of the environmental auditory scene. Two candidate devices were compared. The first output was the bone-conduction headphone, which can deliver speech signals by vibrating the skull, while normal hearing is left intact for surrounding noise since these headphones leave the ear canals open. The other is the binaural microphone/earphone combo, which is in a form factor similar to a regular earphone, but integrates a small microphone at the ear canal entry. The input from these microphones can be fed back to the earphones along with the annotation speech. We also compared these devices to normal hearing (i.e., without headphones or earphones) for reference. We compared the speech intelligibility when competing babble noise is simultaneously given from the surrounding environment. It was found that the binaural combo can generally deliver speech signals at comparable or higher intelligibility than the bone-conduction headphones. However, with the binaural combo, we found that the ear canal transfer characteristics were altered significantly by shutting the ear canals closed with the earphones. Accordingly, if we employed a compensation filter to account for this transfer function deviation, the resultant speech intelligibility was found to be significantly higher. However, both of these devices were found to be acceptable as audio output devices for augmented audio reality applications since both are able to deliver speech signals at high intelligibility even when a significant amount of competing noise is present. In fact, both of these speech output methods were able to deliver speech signals at higher intelligibility than natural speech, especially when the SNR was low.
We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test. We used automatic speech recognizers with language models that force one of the words in the word-pair, mimicking the human recognition process of the DRT. Initial testing was done using speaker-independent models, and they showed significantly lower scores than subjective scores. The acoustic models were then adapted to each of the speakers in the corpus, and then adapted to noise at a specified SNR. Three different types of noise were tested: white noise, multi-talker (babble) noise, and pseudo-speech noise. The match between subjective and estimated scores improved significantly with noise-adapted models compared to speaker-independent models and the speaker-adapted models, when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when tested SNR conditions were higher than the adapted noise level. Accordingly, we adapted the models to mixed levels of noise, i.e., multi-condition training. The adapted models now showed relatively high intelligibility matching subjective intelligibility performance over all levels of noise. The correlation between subjective and estimated intelligibility scores increased to 0.94 with multi-talker noise, 0.93 with white noise, and 0.89 with pseudo-speech noise, while the root mean square error (RMSE) reduced from more than 40 to 13.10, 13.05 and 16.06, respectively.
Takeshi YAMADA Masakazu KUMAKURA Nobuhiko KITAWAKI
It is essential to ensure a satisfactory QoS (Quality of Service) when offering a speech communication system with a noise reduction algorithm. In this paper, we propose a new obejective test methodology for noise-reduced speech that estimates word intelligibility by using a distortion measure. Experimental results confirmed that the proposed methodology gives an accurate estimate with independence of noise reduction algorithms and noise types.
Since an FFT-based speech encryption system retains a considerable residual intelligibility, such as talk spurts and the original intonation in the encrypted speech, this makes it easy for eavesdroppers to deduce the information contents from the encrypted speech. In this letter, we propose a new technique based on the combination of an orthogonal frequency division multiplexing (OFDM) scheme and an appropriate QAM mapping method to remove the residual intelligibility from the encrypted speech by permuting several frequency components. In addition, the proposed OFDM-based speech encryption system needs only two FFT operations instead of the four required by the FFT-based speech encryption system. Simulation results are presented to show the effectiveness of this proposed technique.
Keisuke KINOSHITA Tomohiro NAKATANI Masato MIYOSHI
A speech signal captured by a distant microphone is generally smeared by reverberation, which severely degrades both the speech intelligibility and Automatic Speech Recognition (ASR) performance. Previously, we proposed a single-microphone dereverberation method, named "Harmonicity based dEReverBeration (HERB)." HERB estimates the inverse filter for an unknown room transfer function by utilizing an essential feature of speech, namely harmonic structure. In previous studies, improvements in speech intelligibility was shown solely with spectrograms, and improvements in ASR performance were simply confirmed by matched condition acoustic model. In this paper, we undertook a further investigation of HERB's potential as regards to the above two factors. First, we examined speech intelligibility by means of objective indices. As a result, we found that HERB is capable of improving the speech intelligibility to approximately that of clean speech. Second, since HERB alone could not improve the ASR performance sufficiently, we further analyzed the HERB mechanism with a view to achieving further improvements. Taking the analysis results into account, we proposed an appropriate ASR configuration and conducted experiments. Experimental results confirmed that, if HERB is used with an ASR adaptation scheme such as MLLR and a multicondition acoustic model, it is very effective for improving ASR performance even in unknown severely reverberant environments.
Steven GREENBERG Takayuki ARAI
Classical models of speech recognition assume that a detailed, short-term analysis of the acoustic signal is essential for accurately decoding the speech signal and that this decoding process is rooted in the phonetic segment. This paper presents an alternative view, one in which the time scales required to accurately describe and model spoken language are both shorter and longer than the phonetic segment, and are inherently wedded to the syllable. The syllable reflects a singular property of the acoustic signal -- the modulation spectrum -- which provides a principled, quantitative framework to describe the process by which the listener proceeds from sound to meaning. The ability to understand spoken language (i.e., intelligibility) vitally depends on the integrity of the modulation spectrum within the core range of the syllable (3-10 Hz) and reflects the variation in syllable emphasis associated with the concept of prosodic prominence ("accent"). A model of spoken language is described in which the prosodic properties of the speech signal are embedded in the temporal dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.
Erdenebat DASHTSEREN Shigeyoshi KITAZAWA Satoshi IWASAKI Shinya KIRIYAMA
Our study focuses on an evaluation of a novel speech processing strategy for multi-channel cochlear implant speech processors. Stimulation pulse trains for the Nucleus 24CI speech processor were generated in a way different from the speech processing strategies implemented in this processor. The distinctive features of the novel strategy are: 1) electrode stimulation order driven by location of maximum instantaneous frequency amplitude; 2) variable stimulation rates on electrodes; 3) variable number of selected channels within a cycle of signal processing schema. Within-subject designed tests on Japanese initial, medial and final consonants in CV, VCV and CV/N context tokens were carried out with cochlear implant patients using the Cochlear ACETM strategy, and results were compared with those of normal hearing listeners. Results of the initial and medial consonant tests showed significantly better performance with the novel strategy than with the ACE strategy for both the cochlear implant and normal hearing listener groups. Results of the final consonant tests showed a slightly better performance with the ACE strategy for cochlear implant listeners while showing a slightly better performance with the novel strategy for normal hearing listeners.
This paper describes a complete Mandarin text-to-speech system on time domain. We take advantage of the advancement of memory technology, which achieves ever-increasing capacity and ever-lower price. We try to collect as more as possible the synthesis units in a Mandarin text-to-speech system. With such an effort, we developed simpler speech processing techniques and achieved faster processing speed by using only an ordinary personal computer. We also developed delicate methods to measure the intelligibility, comprehensibility, and naturalness of a Mandarin text-to-speech system. Our system performs very well compared with existing systems. We first develop a set of useful algorithms and methods to deal with some features of the syllables, such as duration, amplitude, fundamental frequency, pause, and so on. Based on these algorithms and methods, we then build a Mandarin text-to-speech system. Given any Chinese text in some computerized form, e. g. , in BIG-5 code representation, our system can pronounce the text in real time. Our text-to-speech system runs on an IBM 80486 compatible PC, with no special hardware for signal processing. The evaluation of our text-to-speech system is based on a proposed subjective evaluation method. An evaluation was made by 51 undergraduate students. The intelligibility of our text-to-speech system is 99. 5%, the comprehensibility of our text-to-speech system is 92. 6%, and the naturalness of our text-to-speech system is 81. 512 points in a percentile grading system (the highest score is 100 points, and the lowest score is 0 point). Other 40 Ph. D. students also did the same evaluation about naturalness. The result shows that the naturalness of our text-to-speech system is 82. 8 points in a percentile grading system.