IEICE global.ieice.org Site

Author Search Result

[Author] Tomoki TODA(25hit)

1-20hit(25hit)

A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models
Shinnosuke TAKAMICHI Tomoki TODA Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Voice conversion

Pubricized:
2016/07/19
Vol:
E99-D No:10
Page(s):
2490-2498
This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.
Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method
Goshu NAGINO Makoto SHOZAKAI Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Corpus

Vol:
E91-D No:3
Page(s):
607-614
This paper proposes a technique for building an effective speech corpus with lower cost by utilizing a statistical multidimensional scaling method. The statistical multidimensional scaling method visualizes multiple HMM acoustic models into two-dimensional space. At first, a small number of voice samples per speaker is collected; speaker adapted acoustic models trained with collected utterances, are mapped into two-dimensional space by utilizing the statistical multidimensional scaling method. Next, speakers located in the periphery of the distribution, in a plotted map are selected; a speech corpus is built by collecting enough voice samples for the selected speakers. In an experiment for building an isolated-word speech corpus, the performance of an acoustic model trained with 200 selected speakers was equivalent to that of an acoustic model trained with 533 non-selected speakers. It means that a cost reduction of more than 62% was achieved. In an experiment for building a continuous word speech corpus, the performance of an acoustic model trained with 500 selected speakers was equivalent to that of an acoustic model trained with 1179 non-selected speakers. It means that a cost reduction of more than 57% was achieved.
Improving Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics in Noisy Environments Using Multi-Template Models
Randy GOMEZ Akinobu LEE Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
998-1005
This paper describes the method of using multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. This adaptation scheme is mainly composed of two processes. The first part is done offline which involves the training of multiple class-dependent acoustic models and the creation of speakers' HMM-Sufficient Statistics based on gender and age. The second part is performed online where adaptation begins using the single utterance of a test speaker. From this utterance, the system will classify the speaker's class and consequently select the N-best neighbor speakers close to the utterance using Gaussian Mixture Models (GMM). The classified speakers' class template model is then adopted as a base model. From this template model, the adapted model is rapidly constructed using the N-best neighbor speakers' HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20 dB, 15 dB and 10 dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word accuracy rate compared with 88.1% of the conventional single-template method, while the baseline recognition rate without adaptation is 86.4%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.
A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation
Kou TANAKA Tomoki TODA Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Voice Conversion and Speech Enhancement

Vol:
E97-D No:6
Page(s):
1429-1437
This paper presents an electrolaryngeal (EL) speech enhancement method capable of significantly improving naturalness of EL speech while causing no degradation in its intelligibility. An electrolarynx is an external device that artificially generates excitation sounds to enable laryngectomees to produce EL speech. Although proficient laryngectomees can produce quite intelligible EL speech, it sounds very unnatural due to the mechanical excitation produced by the device. Moreover, the excitation sounds produced by the device often leak outside, adding to EL speech as noise. To address these issues, there are mainly two conventional approached to EL speech enhancement through either noise reduction or statistical voice conversion (VC). The former approach usually causes no degradation in intelligibility but yields only small improvements in naturalness as the mechanical excitation sounds remain essentially unchanged. On the other hand, the latter approach significantly improves naturalness of EL speech using spectral and excitation parameters of natural voices converted from acoustic parameters of EL speech, but it usually causes degradation in intelligibility owing to errors in conversion. We propose a hybrid approach using a noise reduction method for enhancing spectral parameters and statistical voice conversion method for predicting excitation parameters. Moreover, we further modify the prediction process of the excitation parameters to improve its prediction accuracy and reduce adverse effects caused by unvoiced/voiced prediction errors. The experimental results demonstrate the proposed method yields significant improvements in naturalness compared with EL speech while keeping intelligibility high enough.
NOCOA+: Multimodal Computer-Based Training for Social and Communication Skills
Hiroki TANAKA Sakriani SAKTI Graham NEUBIG Tomoki TODA Satoshi NAKAMURA

PAPER-Educational Technology

Pubricized:
2015/04/28
Vol:
E98-D No:8
Page(s):
1536-1544
Non-verbal communication incorporating visual, audio, and contextual information is important to make sense of and navigate the social world. Individuals who have trouble with social situations often have difficulty recognizing these sorts of non-verbal social signals. In this article, we propose a training tool NOCOA+ (Non-verbal COmmuniation for Autism plus) that uses utterances in visual and audio modalities in non-verbal communication training. We describe the design of NOCOA+, and further perform an experimental evaluation in which we examine its potential as a tool for computer-based training of non-verbal communication skills for people with social and communication difficulties. In a series of four experiments, we investigated 1) the effect of temporal context on the ability to recognize social signals in testing context, 2) the effect of modality of presentation of social stimulus on ability to recognize non-verbal information, 3) the correlation between autistic traits as measured by the autism spectrum quotient (AQ) and non-verbal behavior recognition skills measured by NOCOA+, 4) the effectiveness of computer-based training in improving social skills. We found that context information was helpful for recognizing non-verbal behaviors, and the effect of modality was different. The results also showed a significant relationship between the AQ communication and socialization scores and non-verbal communication skills, and that social skills were significantly improved through computer-based training.
Reducing Computation Time of the Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics
Randy GOMEZ Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Speech and Hearing

Vol:
E90-D No:2
Page(s):
554-561
In real-time speech recognition applications, there is a need to implement a fast and reliable adaptation algorithm. We propose a method to reduce adaptation time of the rapid unsupervised speaker adaptation based on HMM-Sufficient Statistics. We use only a single arbitrary utterance without transcriptions in selecting the N-best speakers' Sufficient Statistics created offline to provide data for adaptation to a target speaker. Further reduction of N-best implies a reduction in adaptation time. However, it degrades recognition performance due to insufficiency of data needed to robustly adapt the model. Linear interpolation of the global HMM-Sufficient Statistics offsets this negative effect and achieves a 50% reduction in adaptation time without compromising the recognition performance. Furthermore, we compared our method with Vocal Tract Length Normalization (VTLN), Maximum A Posteriori (MAP) and Maximum Likelihood Linear Regression (MLLR). Moreover, we tested in office, car, crowd and booth noise environments in 10 dB, 15 dB, 20 dB and 25 dB SNRs.
The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006
Heiga ZEN Tomoki TODA Keiichi TOKUDA

PAPER-Speech and Hearing

Vol:
E91-D No:6
Page(s):
1764-1773
We describe a statistical parametric speech synthesis system developed by a joint group from the Nagoya Institute of Technology (Nitech) and the Nara Institute of Science and Technology (NAIST) for the annual open evaluation of text-to-speech synthesis systems named Blizzard Challenge 2006. To improve our 2005 system (Nitech-HTS 2005), we investigated new features such as mel-generalized cepstrum-based line spectral pairs (MGC-LSPs), maximum likelihood linear transform (MLLT), and a full covariance global variance (GV) probability density function (pdf). A combination of mel-cepstral coefficients, MLLT, and full covariance GV pdf scored highest in subjective listening tests, and the 2006 system performed significantly better than the 2005 system. The Blizzard Challenge 2006 evaluations show that Nitech-NAIST-HTS 2006 is competitive even when working with relatively large speech databases.
Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models
Tobias CINCAREK Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Speech Recognition

Vol:
E89-D No:3
Page(s):
962-969
To obtain a robust acoustic model for a certain speech recognition task, a large amount of speech data is necessary. However, the preparation of speech data including recording and transcription is very costly and time-consuming. Although there are attempts to build generic acoustic models which are portable among different applications, speech recognition performance is typically task-dependent. This paper introduces a method for automatically building task-dependent acoustic models based on selective training. Instead of setting up a new database, only a small amount of task-specific development data needs to be collected. Based on the likelihood of the target model parameters given this development data, utterances which are acoustically close to the development data are selected from existing speech data resources. Since there are too many possibilities for selecting a data subset from a larger database in general, a heuristic has to be employed. The proposed algorithm deletes single utterances temporarily or alternates between successive deletion and addition of multiple utterances. In order to make selective training computationally practical, model retraining and likelihood calculation need to be fast. It is shown, that the model likelihood can be calculated fast and easily based on sufficient statistics without the need for explicit reconstruction of model parameters. The algorithm is applied to obtain an infant- and elderly-dependent acoustic model with only very few development data available. There is an improvement in word accuracy of up to 9% in comparison to conventional EM training without selection. Furthermore, the approach was also better than MLLR and MAP adaptation with the development data.
Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models
Hironori DOI Keigo NAKAMURA Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2472-2482
This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices usually sound unnatural compared with normal speech. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
Tomoki TODA Keiichi TOKUDA

PAPER-Speech and Hearing

Vol:
E90-D No:5
Page(s):
816-824
This paper describes a novel parameter generation algorithm for an HMM-based speech synthesis technique. The conventional algorithm generates a parameter trajectory of static features that maximizes the likelihood of a given HMM for the parameter sequence consisting of the static and dynamic features under an explicit constraint between those two features. The generated trajectory is often excessively smoothed due to the statistical processing. Using the over-smoothed speech parameters usually causes muffled sounds. In order to alleviate the over-smoothing effect, we propose a generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.
Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005
Heiga ZEN Tomoki TODA Masaru NAKAMURA Keiichi TOKUDA

PAPER-Speech and Hearing

Vol:
E90-D No:1
Page(s):
325-333
In January 2005, an open evaluation of corpus-based text-to-speech synthesis systems using common speech datasets, named Blizzard Challenge 2005, was conducted. Nitech group participated in this challenge, entering an HMM-based speech synthesis system called Nitech-HTS 2005. This paper describes the technical details, building processes, and performance of our system. We first give an overview of the basic HMM-based speech synthesis system, and then describe new features integrated into Nitech-HTS 2005 such as STRAIGHT-based vocoding, HSMM-based acoustic modeling, and a speech parameter generation algorithm considering GV. Constructed Nitech-HTS 2005 voices can generate speech waveforms at 0.3RT (real-time ratio) on a 1.6 GHz Pentium 4 machine, and footprints of these voices are less than 2 Mbytes. Subjective listening tests showed that the naturalness and intelligibility of the Nitech-HTS 2005 voices were much better than expected.
Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-Oriented Dialog System
Lasguido NIO Sakriani SAKTI Graham NEUBIG Tomoki TODA Satoshi NAKAMURA

PAPER-Dialog System

Vol:
E97-D No:6
Page(s):
1497-1505
This paper describes the design and evaluation of a method for developing a chat-oriented dialog system by utilizing real human-to-human conversation examples from movie scripts and Twitter conversations. The aim of the proposed method is to build a conversational agent that can interact with users in as natural a fashion as possible, while reducing the time requirement for database design and collection. A number of the challenging design issues we faced are described, including (1) constructing an appropriate dialog corpora from raw movie scripts and Twitter data, and (2) developing an multi domain chat-oriented dialog management system which can retrieve a proper system response based on the current user query. To build a dialog corpus, we propose a unit of conversation called a tri-turn (a trigram conversation turn), as well as extraction and semantic similarity analysis techniques to help ensure that the content extracted from raw movie/drama script files forms appropriate dialog-pair (query-response) examples. The constructed dialog corpora are then utilized in a data-driven dialog management system. Here, various approaches are investigated including example-based (EBDM) and response generation using phrase-based statistical machine translation (SMT). In particular, we use two EBDM: syntactic-semantic similarity retrieval and TF-IDF based cosine similarity retrieval. Experiments are conducted to compare and contrast EBDM and SMT approaches in building a chat-oriented dialog system, and we investigate a combined method that addresses the advantages and disadvantages of both approaches. System performance was evaluated based on objective metrics (semantic similarity and cosine similarity) and human subjective evaluation from a small user study. Experimental results show that the proposed filtering approach effectively improve the performance. Furthermore, the results also show that by combing both EBDM and SMT approaches, we could overcome the shortcomings of each.
Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion
Keigo NAKAMURA Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Rehabilitation Engineering and Assistive Technology

Vol:
E93-D No:7
Page(s):
1909-1917
We have so far proposed a speaking-aid system for laryngectomees using a statistical voice conversion technique. In the proposed system, artificial speech articulated with extremely small sound source signals is detected with a Non-Audible Murmur (NAM) microphone, and then, the detected artificial speech is converted into more natural voice in a probabilistic manner. Although this system basically allows laryngectomees to speak while keeping the external source signals silent, it is still questionable how much these new sound source signals affect the converted speech quality. In this paper, we investigate the impact of various sound source signals on voice conversion accuracy. Various small sound source signals are designed by changing the spectral envelope and the waveform power independently. We conduct objective and subjective evaluations. The results of these experimental evaluations demonstrate that voice conversion accepts 1) various sound source signals with different spectral envelopes and 2) large degree of power of the sound source signals unless the power of speaking parts is almost equal to that of silence parts. Moreover, we also investigate the effectiveness of enhancing auditory feedback during speaking with the extremely small sound source signals.
Daily Activity Recognition with Large-Scaled Real-Life Recording Datasets Based on Deep Neural Network Using Multi-Modal Signals
Tomoki HAYASHI Masafumi NISHIDA Norihide KITAOKA Tomoki TODA Kazuya TAKEDA

PAPER-Engineering Acoustics

Vol:
E101-A No:1
Page(s):
199-210
In this study, toward the development of smartphone-based monitoring system for life logging, we collect over 1,400 hours of data by recording including both the outdoor and indoor daily activities of 19 subjects, under practical conditions with a smartphone and a small camera. We then construct a huge human activity database which consists of an environmental sound signal, triaxial acceleration signals and manually annotated activity tags. Using our constructed database, we evaluate the activity recognition performance of deep neural networks (DNNs), which have achieved great performance in various fields, and apply DNN-based adaptation techniques to improve the performance with only a small amount of subject-specific training data. We experimentally demonstrate that; 1) the use of multi-modal signal, including environmental sound and triaxial acceleration signals with a DNN is effective for the improvement of activity recognition performance, 2) the DNN can discriminate specified activities from a mixture of ambiguous activities, and 3) DNN-based adaptation methods are effective even if only a small amount of subject-specific training data is available.
Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training
Tobias CINCAREK Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Acoustic Modeling

Vol:
E91-D No:3
Page(s):
499-507
Development of an ASR application such as a speech-oriented guidance system for a real environment is expensive. Most of the costs are due to human labeling of newly collected speech data to construct the acoustic model for speech recognition. Employment of existing models or sharing models across multiple applications is often difficult, because the characteristics of speech depend on various factors such as possible users, their speaking style and the acoustic environment. Therefore, this paper proposes a combination of unsupervised learning and selective training to reduce the development costs. The employment of unsupervised learning alone is problematic due to the task-dependency of speech recognition and because automatic transcription of speech is error-prone. A theoretically well-defined approach to automatic selection of high quality and task-specific speech data from an unlabeled data pool is presented. Only those unlabeled data which increase the model likelihood given the labeled data are employed for unsupervised training. The effectivity of the proposed method is investigated with a simulation experiment to construct adult and child acoustic models for a speech-oriented guidance system. A completely human-labeled database which contains real-environment data collected over two years is available for the development simulation. It is shown experimentally that the employment of selective training alleviates the problems of unsupervised learning, i.e. it is possible to select speech utterances of a certain speaker group but discard noise inputs and utterances with lower recognition accuracy. The simulation experiment is carried out for several selected combinations of data collection and human transcription period. It is found empirically that the proposed method is especially effective if only relatively few of the collected data can be labeled and transcribed by humans.
Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion
Kazuhiro KOBAYASHI Tomoki TODA Tomoyasu NAKANO Masataka GOTO Satoshi NAKAMURA

PAPER-Speech and Hearing

Pubricized:
2016/07/21
Vol:
E99-D No:11
Page(s):
2767-2777
As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used as one of the intuitively understandable measures to describe voice characteristics of the singing voice. The use of statistical voice conversion (SVC) with a singer-dependent multiple-regression Gaussian mixture model (MR-GMM), which effectively models the voice timbre variations caused by a change of the perceived age, makes it possible for individual singers to manipulate the perceived ages of their own singing voices while retaining their own singer identities. However, there still remain several issues; e.g., 1) a controllable range of the perceived age is limited; 2) quality of the converted singing voice is significantly degraded compared to that of a natural singing voice; and 3) each singer needs to sing the same phrase set as sung by a reference singer to develop the singer-dependent MR-GMM. To address these issues, we propose the following three methods; 1) a method using gender-dependent modeling to expand the controllable range of the perceived age; 2) a method using direct waveform modification based on spectrum differential to improve quality of the converted singing voice; and 3) a rapid unsupervised adaptation method based on maximum a posteriori (MAP) estimation to easily develop the singer-dependent MR-GMM. The experimental results show that the proposed methods achieve a wider controllable range of the perceived age, a significant quality improvement of the converted singing voice, and the development of the singer-dependnet MR-GMM using only a few arbitrary phrases as adaptation data.
Stereophonic Music Separation Based on Non-Negative Tensor Factorization with Cepstral Distance Regularization
Shogo SEKI Tomoki TODA Kazuya TAKEDA

PAPER-Engineering Acoustics

Vol:
E101-A No:7
Page(s):
1057-1064
This paper proposes a semi-supervised source separation method for stereophonic music signals containing multiple recorded or processed signals, where synthesized music is focused on the stereophonic music. As the synthesized music signals are often generated as linear combinations of many individual source signals and their respective mixing gains, phase or phase difference information between inter-channel signals, which represent spatial characteristics of recording environments, cannot be utilized as acoustic clues for source separation. Non-negative Tensor Factorization (NTF) is an effective technique which can be used to resolve this problem by decomposing amplitude spectrograms of stereo channel music signals into basis vectors and activations of individual music source signals, along with their corresponding mixing gains. However, it is difficult to achieve sufficient separation performance using this method alone, as the acoustic clues available for separation are limited. To address this issue, this paper proposes a Cepstral Distance Regularization (CDR) method for NTF-based stereo channel separation, which involves making the cepstrum of the separated source signals follow Gaussian Mixture Models (GMMs) of the corresponding the music source signal. These GMMs are trained in advance using available samples. Experimental evaluations separating three and four sound sources are conducted to investigate the effectiveness of the proposed method in both supervised and semi-supervised separation frameworks, and performance is also compared with that of a conventional NTF method. Experimental results demonstrate that the proposed method yields significant improvements within both separation frameworks, and that cepstral distance regularization provides better separation parameters.
Voice Timbre Control Based on Perceived Age in Singing Voice Conversion
Kazuhiro KOBAYASHI Tomoki TODA Hironori DOI Tomoyasu NAKANO Masataka GOTO Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Voice Conversion and Speech Enhancement

Vol:
E97-D No:6
Page(s):
1419-1428
The perceived age of a singing voice is the age of the singer as perceived by the listener, and is one of the notable characteristics that determines perceptions of a song. In this paper, we describe an investigation of acoustic features that have an effect on the perceived age, and a novel voice timbre control technique based on the perceived age for singing voice conversion (SVC). Singers can sing expressively by controlling prosody and voice timbre, but the varieties of voices that singers can produce are limited by physical constraints. Previous work has attempted to overcome this limitation through the use of statistical voice conversion. This technique makes it possible to convert singing voice timbre of an arbitrary source singer into those of an arbitrary target singer. However, it is still difficult to intuitively control singing voice characteristics by manipulating parameters corresponding to specific physical traits, such as gender and age. In this paper, we first perform an investigation of the factors that play a part in the listener's perception of the singer's age at first. Then, we applied a multiple-regression Gaussian mixture models (MR-GMM) to SVC for the purpose of controlling voice timbre based on the perceived age and we propose SVC based on the modified MR-GMM for manipulating the perceived age while maintaining singer's individuality. The experimental results show that 1) the perceived age of singing voices corresponds relatively well to the actual age of the singer, 2) prosodic features have a larger effect on the perceived age than spectral features, 3) the individuality of a singer is influenced more heavily by segmental features than prosodic features 4) the proposed voice timbre control method makes it possible to change the singer's perceived age while not having an adverse effect on the perceived individuality.
A Vibration Control Method of an Electrolarynx Based on Statistical F₀ Pattern Prediction
Kou TANAKA Tomoki TODA Satoshi NAKAMURA

PAPER-Rehabilitation Engineering and Assistive Technology

Pubricized:
2017/05/23
Vol:
E100-D No:9
Page(s):
2165-2173
This paper presents a novel speaking aid system to help laryngectomees produce more naturally sounding electrolaryngeal (EL) speech. An electrolarynx is an external device to generate excitation signals, instead of vibration of the vocal folds. Although the conventional EL speech is quite intelligible, its naturalness suffers from the unnatural fundamental frequency (F0) patterns of the mechanically generated excitation signals. To improve the naturalness of EL speech, we have proposed EL speech enhancement methods using statistical F0 pattern prediction. In these methods, the original EL speech recorded by a microphone is presented from a loudspeaker after performing the speech enhancement. These methods are effective for some situation, such as telecommunication, but it is not suitable for face-to-face conversation because not only the enhanced EL speech but also the original EL speech is presented to listeners. In this paper, to develop an EL speech enhancement also effective for face-to-face conversation, we propose a method for directly controlling F0 patterns of the excitation signals to be generated from the electrolarynx using the statistical F0 prediction. To get an "actual feel” of the proposed system, we also implement a prototype system. By using the prototype system, we find latency issues caused by a real-time processing. To address these latency issues, we furthermore propose segmental continuous F0 pattern modeling and forthcoming F0 pattern modeling. With evaluations through simulation, we demonstrate that our proposed system is capable of effectively addressing the issues of latency and those of electrolarynx in term of the naturalness.
Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics
Yuji OSHIMA Shinnosuke TAKAMICHI Tomoki TODA Graham NEUBIG Sakriani SAKTI Satoshi NAKAMURA

PAPER-Speech and Hearing

Pubricized:
2016/08/30
Vol:
E99-D No:12
Page(s):
3132-3139
This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.

1-20hit(25hit)

Author Search Result

[Author] Tomoki TODA(25hit)

A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models

Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method

Improving Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics in Noisy Environments Using Multi-Template Models

A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation

NOCOA+: Multimodal Computer-Based Training for Social and Communication Skills

Reducing Computation Time of the Rapid Unsupervised Speaker Adaptation Based on HMM-Sufficient Statistics

The Nitech-NAIST HMM-Based Speech Synthesis System for the Blizzard Challenge 2006

Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models

Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models

A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

Details of the Nitech HMM-Based Speech Synthesis System for the Blizzard Challenge 2005

Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-Oriented Dialog System

Evaluation of Extremely Small Sound Source Signals Used in Speaking-Aid System with Statistical Voice Conversion

Daily Activity Recognition with Large-Scaled Real-Life Recording Datasets Based on Deep Neural Network Using Multi-Modal Signals

Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training

Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

Stereophonic Music Separation Based on Non-Negative Tensor Factorization with Cepstral Distance Regularization

Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

A Vibration Control Method of an Electrolarynx Based on Statistical F₀ Pattern Prediction

Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles