1-11hit |
Takeshi YAMADA Masakazu KUMAKURA Nobuhiko KITAWAKI
It is essential to ensure a satisfactory QoS (Quality of Service) when offering a speech communication system with a noise reduction algorithm. In this paper, we propose a new obejective test methodology for noise-reduced speech that estimates word intelligibility by using a distortion measure. Experimental results confirmed that the proposed methodology gives an accurate estimate with independence of noise reduction algorithms and noise types.
Nobuhiko KITAWAKI Takeshi YAMADA Futoshi ASANO
Appropriate test signals defined by formula or generated by algorithm are used for measuring objective QoS (Quality of Services) for voice operated telecommunication devices such as telephone and speech codec (coder-decoder). However, that for measuring residual echo characteristics in hands-free telecommunications equipped with acoustic echo canceller is under study in ITU-T Recommendation G.167. This paper describes comparative assessment of test signals for measurement of residual echo characteristics. In hands-free telecommunications, acoustical echo canceller has been developed to remove a room echo signal through the loudspeaker to the microphone in the receiving end. Performance of the echo canceller system is evaluated by residual echo characteristics expressed in echo return loss enhancement (ERLE). The ERLE can be conventionally measured by putting white noise into the echo canceller system. However, white noise is not adequate as the test signal for measuring the performance of the echo canceller, since the performance may depend on the characteristics of input test signal, and the characteristics of the white noise differ from those of real voice. Therefore, this paper discusses appropriate characteristics of real voice required for objective quality evaluation of echo canceller system. The test signals used for this verification tests were real voice (RV), white noise (WN), frequency weighted noise (FWN), artificial voice (AV), and composite source signal (CSS) depending on the approximation of real voice characteristics. As the comparative assessment results, the ERLE characteristics measured by artificial voice conforming to ITU-T Recommendation P.50 having average characteristics of real voices in time and frequency domains are almost equivalent to those of real voice and best among those test signals. It is concluded that artificial voice P.50 is satisfied with measurement of residual echo characteristics.
Satoshi NAKAMURA Kazuya TAKEDA Kazumasa YAMAMOTO Takeshi YAMADA Shingo KUROIWA Norihide KITAOKA Takanobu NISHIURA Akira SASOU Mitsunori MIZUMACHI Chiyomi MIYAJIMA Masakiyo FUJIMOTO Toshiki ENDO
This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.
Kiyoshi YAMAMOTO Futoshi ASANO Takeshi YAMADA Nobuhiko KITAWAKI
In this paper, a method of detecting overlapping speech segments in meetings is proposed. It is known that the eigenvalue distribution of the spatial correlation matrix calculated from a multiple microphone input reflects information on the number and relative power of sound sources. However, in a reverberant sound field, the feature of the number of sources in the eigenvalue distribution is degraded by the room reverberation. In the Support Vector Machines approach, the eigenvalue distribution is classified into two classes (overlapping speech segments and single speech segments). In the Support Vector Regression approach, the relative power of sound sources is estimated by using the eigenvalue distribution, and overlapping speech segments are detected based on the estimated relative power. The salient feature of this approach is that the sensitivity of detecting overlapping speech segments can be controlled simply by changing the threshold value of the relative power. The proposed method was evaluated using recorded data of an actual meeting.
Yuya SUGIMOTO Shigeki MIYABE Takeshi YAMADA Shoji MAKINO Biing-Hwang JUANG
MUltiple SIgnal Classification (MUSIC) is a standard technique for direction of arrival (DOA) estimation with high resolution. However, MUSIC cannot estimate DOAs accurately in the case of underdetermined conditions, where the number of sources exceeds the number of microphones. To overcome this drawback, an extension of MUSIC using cumulants called 2q-MUSIC has been proposed, but this method greatly suffers from the variance of the statistics, given as the temporal mean of the observation process, and requires long observation. In this paper, we propose a new approach for extending MUSIC that exploits higher-order moments of the signal for the underdetermined DOA estimation with smaller variance. We propose an estimation algorithm that nonlinearly maps the observed signal onto a space with expanded dimensionality and conducts MUSIC-based correlation analysis in the expanded space. Since the dimensionality of the noise subspace is increased by the mapping, the proposed method enables the estimation of DOAs in the case of underdetermined conditions. Furthermore, we describe the class of mapping that allows us to analyze the higher-order moments of the observed signal in the original space. We compare 2q-MUSIC and the proposed method through an experiment assuming that the true number of sources is known as prior information to evaluate in terms of the bias-variance tradeoff of the statistics and computational complexity. The results clarify that the proposed method has advantages for both computational complexity and estimation accuracy in short-time analysis, i.e., the time duration of the analyzed data is short.
Shinnichiro YAMAMOTO Kennichi HATAKEYAMA Kenji YAMAUCHI Takeshi YAMADA
A new shielding evaluation setup for conductive O-rings is proposed. This setup consists of the holder with a groove to fix the O-ring position. There are two ways to apply O-rings in narrow gaps, cylinder-fixing and plane-fixing. With this holder shielding effects of the O-rings can be evaluated from 10 kHz to 1 GHz for both fixing types.
Nobuhiko KITAWAKI Kou NAGAI Takeshi YAMADA
Recently, wideband speech communication using 7 kHz-wideband speech coding, as described in ITU-T Recommendations G.722, G.722.1, and G.722.2, has become increasingly necessary for use in advanced IP telephony using PCs, since, for this application, hands-free communication using separate microphones and loudspeakers is indispensable, and in this situation wideband speech is particularly helpful in enhancing the naturalness of communication. An objective quality measurement methodology for wideband-speech coding has been studied, its essential components being an objective quality measure and an input test signal. This paper describes Wideband-PESQ conforming to the draft Annex to ITU-T Recommendation P.862, "Perceptual Evaluation of Speech Quality (PESQ)," as the objective quality measure, by evaluating the consistency between the subjectively evaluated MOS (Mean Opinion Score) and objectively estimated MOS. This paper also describes the verification of artificial voice conforming to Recommendation P.50 "Artificial Voices," as the input test signal for such measurements, by evaluating the consistency between the objectively estimated MOS using a real voice and that obtained using an artificial voice.
Kazumi SAITO Takeshi YAMADA Kazuhiro KAZAMA
To understand the structural and functional properties of large-scale complex networks, it is crucial to efficiently extract a set of cohesive subnetworks as communities. There have been proposed several such community extraction methods in the literature, including the classical k-core decomposition method and, more recently, the k-clique based community extraction method. The k-core method, although computationally efficient, is often not powerful enough for uncovering a detailed community structure and it produces only coarse-grained and loosely connected communities. The k-clique method, on the other hand, can extract fine-grained and tightly connected communities but requires a substantial amount of computational load for large-scale complex networks. In this paper, we present a new notion of a subnetwork called k-dense, and propose an efficient algorithm for extracting k-dense communities. We applied our method to the three different types of networks assembled from real data, namely, from blog trackbacks, word associations and Wikipedia references, and demonstrated that the k-dense method could extract communities almost as efficiently as the k-core method, while the qualities of the extracted communities are comparable to those obtained by the k-clique method.
Takeshi YAMADA Yuki KASUYA Yuki SHINOHARA Nobuhiko KITAWAKI
This paper describes non-reference objective quality evaluation for noise-reduced speech. First, a subjective test is conducted in accordance with ITU-T Rec. P.835 to obtain the speech quality, the noise quality, and the overall quality of noise-reduced speech. Based on the results, we then propose an overall quality estimation model. The unique point of the proposed model is that the estimation of the overall quality is done only using the previously estimated speech quality and noise quality, in contrast to conventional models, which utilize the acoustical features extracted. Finally, we propose a non-reference objective quality evaluation method using the proposed model. The results of an experiment with different noise reduction algorithms and noise types confirmed that the proposed method gives more accurate estimates of the overall quality compared with the method described in ITU-T Rec. P.563.
Panikos HERACLEOUS Satoshi NAKAMURA Takeshi YAMADA Kiyohiro SHIKANO
This paper describes a method for hands-free speech recognition, and particularly for the simultaneous recognition of multiple sound sources. The method is based on the 3-D Viterbi search, i.e., extended to the 3-D N-best search method enabling the recognition of multiple sound sources. The baseline system integrates two existing technologies--3-D Viterbi search and conventional N-best search--into a complete system. Previously, the first evaluation of the 3-D N-best search-based system showed that new ideas are necessary to develop a system for the simultaneous recognition of multiple sound sources. It found two factors that play important roles in the performance of the system, namely the different likelihood ranges of the sound sources and the direction-based separation of the hypotheses. In order to solve these problems, we implemented a likelihood normalization and a path distance-based clustering technique into the baseline 3-D N-best search-based system. The performance of our system was evaluated through experiments on simulated data for the case of two talkers. The experiments showed significant improvements by implementing the above two techniques. The best results were obtained by implementing the two techniques and using a microphone array composed of 32 channels. More specifically, the Word Accuracy for the two talkers was higher than 80% and the Simultaneous Word Accuracy (where both sources are correctly recognized simultaneously) was higher than 70%, which are very promising results.
Takeshi YAMADA Hideo SAITO Shinji OZAWA
This paper proposes a new method for reconstruction a shape of skin surface replica from shaded image sequence taken with different light source directions. Since the shaded images include shadows caused by surface height fluctuation, and specular and inter reflections, the conventional photometric stereo method is not suitable for reconstructing its surface accurately. In the proposed method, we choose measured intensity which does not include specular and inter reflections and self-shadows so that we can calculate accurate normal vector from the selected measured intensity using SVD (Singular Value Decomposition) method. The experimental results from real images demonstrate that the proposed method is effective for shape reconstruction from shaded images, which include specular and inter reflections and self-shadows.