1-16hit |
Tetsuji OGAWA Tetsunori KOBAYASHI
The accuracy of simulation-based assessments of speech recognition systems under noisy conditions is investigated with a focus on the influence of the Lombard effect on the speech recognition performances. This investigation was carried out under various recognition conditions of different sound pressure levels of ambient noise, for different recognition tasks, such as continuous speech recognition and spoken word recognition, and using different recognition systems, i.e., systems with and without adaptation of the acoustic models to ambient noise. Experimental results showed that accurate simulation was not always achieved when dry sources with neutral talking style were used, but it could be achieved if the dry sources that include the influence of the Lombard effect were used; the simulation in the latter case is accurate, irrespective of the recognition conditions.
Tadashi OKUBO Ryo MOCHIZUKI Tetsunori KOBAYASHI
We propose a hybrid voice conversion method which employs a combination of techniques using HMM-based unit selection and spectrum generation. In the proposed method, the HMM-based unit selection selects the most likely unit for the required phoneme context from the target speaker's corpus when candidates of the target unit exist in the corpus. Unit selection is performed based on the sequence of the spectral probability distribution obtained from the adapted HMMs. On the other hand, when a target unit does not exist in a corpus, a target waveform is generated from the adapted HMM sequence by maximizing the spectral likelihood. The proposed method also employs the HMM in which the spectral probability distribution is adjusted to the target prosody using the weight defined by the prosodic probability of each distribution. To show the effectiveness of the proposed method, sound quality and speaker individuality tests were conducted. The results revealed that the proposed method could produce high-quality speech and individuality of the synthesized sound was more similar to the target speaker compared to conventional methods.
Kazuya UEKI Tetsunori KOBAYASHI
An age-group classification method based on a fusion of different classifiers with different two-dimensional feature extraction algorithms is proposed. Theoretically, an integration of multiple classifiers can provide better performance compared to a single classifier. In this paper, we extract effective features from one sample image using different dimensional reduction methods, construct multiple classifiers in each subspace, and combine them to reduce age-group classification errors. As for the dimensional reduction methods, two-dimensional PCA (2DPCA) and two-dimensional LDA (2DLDA) are used. These algorithms are antisymmetric in the treatment of the rows and the columns of the images. We prepared the row-based and column-based algorithms to make two different classifiers with different error tendencies. By combining these classifiers with different errors, the performance can be improved. Experimental results show that our fusion-based age-group classification method achieves better performance than existing two-dimensional algorithms alone.
Tetsuji OGAWA Shintaro TAKADA Kenzo AKAGIRI Tetsunori KOBAYASHI
We propose a new speech enhancement method suitable for mobile devices used in the presence of various types of noise. In order to achieve high-performance speech recognition and auditory perception in mobile devices, various types of noise have to be removed under the constraints of a space-saving microphone arrangement and few computational resources. The proposed method can reduce both the directional noise and the diffuse noise under the abovementioned constraints for mobile devices by employing a square microphone array and conducting low-computational-cost processing that consists of multiple null beamforming, minimum power channel selection, and Wiener filtering. The effectiveness of the proposed method is experimentally verified in terms of speech recognition accuracy and speech quality when both the directional noise and the diffuse noise are observed simultaneously; this method reduces the number of word errors and improves the log-spectral distances as compared to conventional methods.
Shoei SATO Akio KOBAYASHI Kazuo ONOE Shinichi HOMMA Toru IMAI Tohru TAKAGI Tetsunori KOBAYASHI
We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights from the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a search space without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory system's ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream.
Shigeki OKAWA Takashi ENDO Tetsunori KOBAYASHI Katsuhiko SHIRAI
In this paper, a new scheme for ohrase recognition in conversational speech is proposed, in which prosodic and phonemic information processing are usefully combined. This approach is employed both to produce candidates of phrase boundaries and to discriminate phonemes. The fundamental frequency patterns of continuous utterances are statistically analyzed and the likelihood of the occurrence of a phrase boundary is calculated for every frame. At the same time, the likelihood of phonemic characteristics of each frame can be obtained using a hierarchical clustering method. These two scores, along with lexical and grammatical constraints, can be effectively utilized to develop a possible word sequences or a word lattices which correspond to the continuous speech utterances. Our preliminary experjment shows the feasibility of applying prosody for continuous speech recognition especially for conversational style utterances.
Yosuke SATO Tetsuji OGAWA Tetsunori KOBAYASHI
We propose a modified Hidden Markov Model (HMM) with a view to improve gesture recognition using a moving camera. The conventional HMM is formulated so as to deal with only one feature candidate per frame. However, for a mobile robot, the background and the lighting conditions are always changing, and the feature extraction problem becomes difficult. It is almost impossible to extract a reliable feature vector under such conditions. In this paper, we define a new gesture recognition framework in which multiple candidates of feature vectors are generated with confidence measures and the HMM is extended to deal with these multiple feature vectors. Experimental results comparing the proposed system with feature vectors based on DCT and the method of selecting only one candidate feature point verifies the effectiveness of the proposed technique.
Yosuke MATSUSAKA Tsuyoshi TOJO Tetsunori KOBAYASHI
We developed a conversation system which can participate in a group conversation. Group conversation is a form of conversation in which three or more participants talk to each other about a topic on an equal footing. Conventional conversation systems have been designed under the assumption that each system merely talked with only one person. Group conversation is different from these conventional systems in the following points. It is necessary for the system to understand the conversational situation such as who is speaking, to whom he is speaking, and also to whom the other participants pay attention. It is also necessary for the system itself to try to affect the situation appropriately. In this study, we realized the function of recognizing the conversational situation, by combining image processing and acoustic processing, and the function of working on the conversational situation utilizing facial and body actions of the robot. Thus, a robot that can join in the group conversation was realized.
Ryo MOCHIZUKI Tetsunori KOBAYASHI
A low-band spectrum envelope reconstruction method was tested to see if it could improve the sound quality of F0 modified speech with the PSOLA (Pitch Synchronous OverLap Add) method. In the conventional PSOLA method, the extracted spectrum envelope using a Hanning window with two-pitch-period length had no reliable information in the band of frequencies lower than the original F0. This problem causes sound degradation of the F0 modified speech when the F0 is shifted downward. In the proposed method, the low-band spectrum envelope was properly modified according to the F0 modification rate. The amplitude of the F0 harmonic components in the low-band were reproduced based on the spectral tilt of the spectrum envelope. Subjective listening tests suggest the proposed method yields improved sound quality than the conventional TD-PSOLA method when the downward modification rate exceeds 0.4 octave.
Tetsuji OGAWA Tetsunori KOBAYASHI
A discriminative modeling is applied to optimize the structure of a Partly-Hidden Markov Model (PHMM). PHMM was proposed in our previous work to deal with the complicated temporal changes of acoustic features. It can represent observation dependent behaviors in both observations and state transitions. In the formulation of the previous PHMM, we used a common structure for all models. However, it is expected that the optimal structure which gives the best performance differs from category to category. In this paper, we designed a new structure optimization method in which the dependence of the states and the observations of PHMM are optimally defined according to each model using the weighted likelihood-ratio maximization (WLRM) criterion. The WLRM criterion gives high discriminability between the correct category and the incorrect categories. Therefore it gives model structures with good discriminative performance. We define the model structure combination which satisfy the WLRM criterion for any possible structure combinations as the optimal structures. A genetic algorithm is also applied to the adequate approximation of a full search. With results of continuous lecture talk speech recognition, the effectiveness of the proposed structure optimization is shown: it reduced the word errors compared to HMM and PHMM with a common structure for all models.
Naohiro TAWARA Atsunori OGAWA Tomoharu IWATA Hiroto ASHIKAWA Tetsunori KOBAYASHI Tetsuji OGAWA
Most conventional multi-source domain adaptation techniques for recurrent neural network language models (RNNLMs) are domain-centric. In these approaches, each domain is considered independently and this makes it difficult to apply the models to completely unseen target domains that are unobservable during training. Instead, our study exploits domain attributes, which represent common knowledge among such different domains as dialects, types of wordings, styles, and topics, to achieve domain generalization that can robustly represent unseen target domains by combining the domain attributes. To achieve attribute-based domain generalization system in language modeling, we introduce domain attribute-based experts to a multi-stream RNNLM called recurrent adaptive mixture model (RADMM) instead of domain-based experts. In the proposed system, a long short-term memory is independently trained on each domain attribute as an expert model. Then by integrating the outputs from all the experts in response to the context-dependent weight of the domain attributes of the current input, we predict the subsequent words in the unseen target domain and exploit the specific knowledge of each domain attribute. To demonstrate the effectiveness of our proposed domain attributes-centric language model, we experimentally compared the proposed model with conventional domain-centric language model by using texts taken from multiple domains including different writing styles, topics, dialects, and types of wordings. The experimental results demonstrated that lower perplexity can be achieved using domain attributes.
Naoya MOCHIKI Tetsuji OGAWA Tetsunori KOBAYASHI
A new type of sound source segregation method using robot-mounted microphones, which are free from strict head related transfer function (HRTF) estimation, has been proposed and successfully applied to three simultaneous speech recognition systems. The proposed segregation method is executed with sound intensity differences that are due to the particular arrangement of the four directivity microphones and the existence of a robot head acting as a sound barrier. The proposed method consists of three-layered signal processing: two-line SAFIA (binary masking based on the narrow band sound intensity comparison), two-line spectral subtraction and their integration. We performed 20 K vocabulary continuous speech recognition test in the presence of three speakers' simultaneous talk, and achieved more than 70% word error reduction compared with the case without any segregation processing.
Tetsuji OGAWA Kazuya UEKI Tetsunori KOBAYASHI
We propose a novel method of supervised feature projection called class-distance-based discriminant analysis (CDDA), which is suitable for automatic age estimation (AAE) from facial images. Most methods of supervised feature projection, e.g., Fisher discriminant analysis (FDA) and local Fisher discriminant analysis (LFDA), focus on determining whether two samples belong to the same class (i.e., the same age in AAE) or not. Even if an estimated age is not consistent with the correct age in AAE systems, i.e., the AAE system induces error, smaller errors are better. To treat such characteristics in AAE, CDDA determines between-class separability according to the class distance (i.e., difference in ages); two samples with similar ages are imposed to be close and those with spaced ages are imposed to be far apart. Furthermore, we propose an extension of CDDA called local CDDA (LCDDA), which aims at handling multimodality in samples. Experimental results revealed that CDDA and LCDDA could extract more discriminative features than FDA and LFDA.
Naoya MOCHIKI Tetsuji OGAWA Tetsunori KOBAYASHI
We propose a new type of direction-of-arrival estimation method for robot audition that is free from strict head related transfer function estimation. The proposed method is based on statistical pattern recognition that employs a ratio of power spectrum amplitudes occurring for a microphone pair as a feature vector. It does not require any phase information explicitly, which is frequently used in conventional techniques, because the phase information is unreliable for the case in which strong reflections and diffractions occur around the microphones. The feature vectors we adopted can treat these influences naturally. The effectiveness of the proposed method was shown from direction-of-arrival estimation tests for 19 kinds of directions: 92.4% of errors were reduced compared with the conventional phase-based method.
Kenji HASHIMOTO Takemi MOCHIDA Yasuaki SATO Tetsunori KOBAYASHI Katsuhiko SHIRAI
For the production of high quality synthetic sounds in a text-to-speech system, an excellent synthesizing method of speech signals is indispensable. In this paper, a new speech analysis-synthesis method for the text-to-speech system is proposed. The signals of voiced speech, which have a line spectrum structure at intervals of pitch in the linear frequency domain, can be represented approximately by the superposition of sinusoidal waves. In our system, analysis and synthesis are performed using such a harmonic structure of the signals of voiced speech. In the analysis phase, assuming an exact harmonic structure model at intervals of pitch against the fine structure of the short-time power spectrum, the fundamental frequency f0 is decided so as to minimize the error of the log-power spectrum at each peak position. At the same time, according to the value of the above minimized error, the rate of periodicity of the speech signal is detemined. Then the log-power spectrum envelope is represented by the cosine-series interpolating the data which are sampled at every pitch period. In the synthesis phase, numerical solutions of non-linear differential equations which generate sinusoidal waves are used. For voiced sounds, those equations behave as a group of mutually synchronized oscillators. These sinusoidal waves are superposed so as to reconstruct the line spectrum structure. For voiceless sounds, those non-linear differential equations work as passive filters with input noise sources. Our system has some characteristics as follows. (1) Voiced and voiceless sounds can be treated in a same framowork. (2) Since the phase and the power information of each sinusoidal wave can be easily controlled, if necessary, periodic waveforms in the voiced sounds can be precisely reproduced in the time domain. (3) The fundamental frequency f0 and phoneme duration can be easily changed without much degradation of original sound quality.
Satoru HAYAMIZU Shuichi ITAHASHI Tetsunori KOBAYASHI Toshiyuki TAKEZAWA
This paper describes issues on dialogue corpora for speech and natural language research. Speech and text corpora of dialogue have recently become more important for the development and the evaluation of speech and text-based dialogue systems. However, the design and the construction of dialogue corpora themselves still remain research issues and many problems have not yet been clarified. Many kinds of corpus are necessary to study various aspects of dialogues. On the other hand, each corpus should contain a certain quantity for each purpose in order to make it statistically meaningful. This paper presents the issues related with design and creation of dialogue corpora; the selection of a task domain, transcription conventions, situations for the collection, syntactic and semantic ill-formedness, and politeness. Future directions for dialogue corpora creation are also discussed.