The search functionality is under construction.

Author Search Result

[Author] Graham NEUBIG(10hit)

1-10hit
  • Utilizing Human-to-Human Conversation Examples for a Multi Domain Chat-Oriented Dialog System

    Lasguido NIO  Sakriani SAKTI  Graham NEUBIG  Tomoki TODA  Satoshi NAKAMURA  

     
    PAPER-Dialog System

      Vol:
    E97-D No:6
      Page(s):
    1497-1505

    This paper describes the design and evaluation of a method for developing a chat-oriented dialog system by utilizing real human-to-human conversation examples from movie scripts and Twitter conversations. The aim of the proposed method is to build a conversational agent that can interact with users in as natural a fashion as possible, while reducing the time requirement for database design and collection. A number of the challenging design issues we faced are described, including (1) constructing an appropriate dialog corpora from raw movie scripts and Twitter data, and (2) developing an multi domain chat-oriented dialog management system which can retrieve a proper system response based on the current user query. To build a dialog corpus, we propose a unit of conversation called a tri-turn (a trigram conversation turn), as well as extraction and semantic similarity analysis techniques to help ensure that the content extracted from raw movie/drama script files forms appropriate dialog-pair (query-response) examples. The constructed dialog corpora are then utilized in a data-driven dialog management system. Here, various approaches are investigated including example-based (EBDM) and response generation using phrase-based statistical machine translation (SMT). In particular, we use two EBDM: syntactic-semantic similarity retrieval and TF-IDF based cosine similarity retrieval. Experiments are conducted to compare and contrast EBDM and SMT approaches in building a chat-oriented dialog system, and we investigate a combined method that addresses the advantages and disadvantages of both approaches. System performance was evaluated based on objective metrics (semantic similarity and cosine similarity) and human subjective evaluation from a small user study. Experimental results show that the proposed filtering approach effectively improve the performance. Furthermore, the results also show that by combing both EBDM and SMT approaches, we could overcome the shortcomings of each.

  • Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

    Kazuhiro KOBAYASHI  Tomoki TODA  Hironori DOI  Tomoyasu NAKANO  Masataka GOTO  Graham NEUBIG  Sakriani SAKTI  Satoshi NAKAMURA  

     
    PAPER-Voice Conversion and Speech Enhancement

      Vol:
    E97-D No:6
      Page(s):
    1419-1428

    The perceived age of a singing voice is the age of the singer as perceived by the listener, and is one of the notable characteristics that determines perceptions of a song. In this paper, we describe an investigation of acoustic features that have an effect on the perceived age, and a novel voice timbre control technique based on the perceived age for singing voice conversion (SVC). Singers can sing expressively by controlling prosody and voice timbre, but the varieties of voices that singers can produce are limited by physical constraints. Previous work has attempted to overcome this limitation through the use of statistical voice conversion. This technique makes it possible to convert singing voice timbre of an arbitrary source singer into those of an arbitrary target singer. However, it is still difficult to intuitively control singing voice characteristics by manipulating parameters corresponding to specific physical traits, such as gender and age. In this paper, we first perform an investigation of the factors that play a part in the listener's perception of the singer's age at first. Then, we applied a multiple-regression Gaussian mixture models (MR-GMM) to SVC for the purpose of controlling voice timbre based on the perceived age and we propose SVC based on the modified MR-GMM for manipulating the perceived age while maintaining singer's individuality. The experimental results show that 1) the perceived age of singing voices corresponds relatively well to the actual age of the singer, 2) prosodic features have a larger effect on the perceived age than spectral features, 3) the individuality of a singer is influenced more heavily by segmental features than prosodic features 4) the proposed voice timbre control method makes it possible to change the singer's perceived age while not having an adverse effect on the perceived individuality.

  • Bayesian Learning of a Language Model from Continuous Speech

    Graham NEUBIG  Masato MIMURA  Shinsuke MORI  Tatsuya KAWAHARA  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:2
      Page(s):
    614-625

    We propose a novel scheme to learn a language model (LM) for automatic speech recognition (ASR) directly from continuous speech. In the proposed method, we first generate phoneme lattices using an acoustic model with no linguistic constraints, then perform training over these phoneme lattices, simultaneously learning both lexical units and an LM. As a statistical framework for this learning problem, we use non-parametric Bayesian statistics, which make it possible to balance the learned model's complexity (such as the size of the learned vocabulary) and expressive power, and provide a principled learning algorithm through the use of Gibbs sampling. Implementation is performed using weighted finite state transducers (WFSTs), which allow for the simple handling of lattice input. Experimental results on natural, adult-directed speech demonstrate that LMs built using only continuous speech are able to significantly reduce ASR phoneme error rates. The proposed technique of joint Bayesian learning of lexical units and an LM over lattices is shown to significantly contribute to this improvement.

  • Non-Native Text-to-Speech Preserving Speaker Individuality Based on Partial Correction of Prosodic and Phonetic Characteristics

    Yuji OSHIMA  Shinnosuke TAKAMICHI  Tomoki TODA  Graham NEUBIG  Sakriani SAKTI  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2016/08/30
      Vol:
    E99-D No:12
      Page(s):
    3132-3139

    This paper presents a novel non-native speech synthesis technique that preserves the individuality of a non-native speaker. Cross-lingual speech synthesis based on voice conversion or Hidden Markov Model (HMM)-based speech synthesis is a technique to synthesize foreign language speech using a target speaker's natural speech uttered in his/her mother tongue. Although the technique holds promise to improve a wide variety of applications, it tends to cause degradation of target speaker's individuality in synthetic speech compared to intra-lingual speech synthesis. This paper proposes a new approach to speech synthesis that preserves speaker individuality by using non-native speech spoken by the target speaker. Although the use of non-native speech makes it possible to preserve the speaker individuality in the synthesized target speech, naturalness is significantly degraded as the synthesized speech waveform is directly affected by unnatural prosody and pronunciation often caused by differences in the linguistic systems of the source and target languages. To improve naturalness while preserving speaker individuality, we propose (1) a prosody correction method based on model adaptation, and (2) a phonetic correction method based on spectrum replacement for unvoiced consonants. The experimental results using English speech uttered by native Japanese speakers demonstrate that (1) the proposed methods are capable of significantly improving naturalness while preserving the speaker individuality in synthetic speech, and (2) the proposed methods also improve intelligibility as confirmed by a dictation test.

  • Structured Adaptive Regularization of Weight Vectors for a Robust Grapheme-to-Phoneme Conversion Model

    Keigo KUBO  Sakriani SAKTI  Graham NEUBIG  Tomoki TODA  Satoshi NAKAMURA  

     
    PAPER-Speech Synthesis and Related Topics

      Vol:
    E97-D No:6
      Page(s):
    1468-1476

    Grapheme-to-phoneme (g2p) conversion, used to estimate the pronunciations of out-of-vocabulary (OOV) words, is a highly important part of recognition systems, as well as text-to-speech systems. The current state-of-the-art approach in g2p conversion is structured learning based on the Margin Infused Relaxed Algorithm (MIRA), which is an online discriminative training method for multiclass classification. However, it is known that the aggressive weight update method of MIRA is prone to overfitting, even if the current example is an outlier or noisy. Adaptive Regularization of Weight Vectors (AROW) has been proposed to resolve this problem for binary classification. In addition, AROW's update rule is simpler and more efficient than that of MIRA, allowing for more efficient training. Although AROW has these advantages, it has not been applied to g2p conversion yet. In this paper, we first apply AROW on g2p conversion task which is structured learning problem. In an evaluation that employed a dataset generated from the collective knowledge on the Web, our proposed approach achieves a 6.8% error reduction rate compared to MIRA in terms of phoneme error rate. Also the learning time of our proposed approach was shorter than that of MIRA in almost datasets.

  • Enhancing Event-Related Potentials Based on Maximum a Posteriori Estimation with a Spatial Correlation Prior

    Hayato MAKI  Tomoki TODA  Sakriani SAKTI  Graham NEUBIG  Satoshi NAKAMURA  

     
    PAPER

      Pubricized:
    2016/04/01
      Vol:
    E99-D No:6
      Page(s):
    1437-1446

    In this paper a new method for noise removal from single-trial event-related potentials recorded with a multi-channel electroencephalogram is addressed. An observed signal is separated into multiple signals with a multi-channel Wiener filter whose coefficients are estimated based on parameter estimation of a probabilistic generative model that locally models the amplitude of each separated signal in the time-frequency domain. Effectiveness of using prior information about covariance matrices to estimate model parameters and frequency dependent covariance matrices were shown through an experiment with a simulated event-related potential data set.

  • A Statistical Sample-Based Approach to GMM-Based Voice Conversion Using Tied-Covariance Acoustic Models

    Shinnosuke TAKAMICHI  Tomoki TODA  Graham NEUBIG  Sakriani SAKTI  Satoshi NAKAMURA  

     
    PAPER-Voice conversion

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2490-2498

    This paper presents a novel statistical sample-based approach for Gaussian Mixture Model (GMM)-based Voice Conversion (VC). Although GMM-based VC has the promising flexibility of model adaptation, quality in converted speech is significantly worse than that of natural speech. This paper addresses the problem of inaccurate modeling, which is one of the main reasons causing the quality degradation. Recently, we have proposed statistical sample-based speech synthesis using rich context models for high-quality and flexible Hidden Markov Model (HMM)-based Text-To-Speech (TTS) synthesis. This method makes it possible not only to produce high-quality speech by introducing ideas from unit selection synthesis, but also to preserve flexibility of the original HMM-based TTS. In this paper, we apply this idea to GMM-based VC. The rich context models are first trained for individual joint speech feature vectors, and then we gather them mixture by mixture to form a Rich context-GMM (R-GMM). In conversion, an iterative generation algorithm using R-GMMs is used to convert speech parameters, after initialization using over-trained probability distributions. Because the proposed method utilizes individual speech features, and its formulation is the same as that of conventional GMM-based VC, it makes it possible to produce high-quality speech while keeping flexibility of the original GMM-based VC. The experimental results demonstrate that the proposed method yields significant improvements in term of speech quality and speaker individuality in converted speech.

  • Neural Network Approaches to Dialog Response Retrieval and Generation

    Lasguido NIO  Sakriani SAKTI  Graham NEUBIG  Koichiro YOSHINO  Satoshi NAKAMURA  

     
    PAPER-Spoken dialog system

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2508-2517

    In this work, we propose a new statistical model for building robust dialog systems using neural networks to either retrieve or generate dialog response based on an existing data sources. In the retrieval task, we propose an approach that uses paraphrase identification during the retrieval process. This is done by employing recursive autoencoders and dynamic pooling to determine whether two sentences with arbitrary length have the same meaning. For both the generation and retrieval tasks, we propose a model using long short term memory (LSTM) neural networks that works by first using an LSTM encoder to read in the user's utterance into a continuous vector-space representation, then using an LSTM decoder to generate the most probable word sequence. An evaluation based on objective and subjective metrics shows that the new proposed approaches have the ability to deal with user inputs that are not well covered in the database compared to standard example-based dialog baselines.

  • A Hybrid Approach to Electrolaryngeal Speech Enhancement Based on Noise Reduction and Statistical Excitation Generation

    Kou TANAKA  Tomoki TODA  Graham NEUBIG  Sakriani SAKTI  Satoshi NAKAMURA  

     
    PAPER-Voice Conversion and Speech Enhancement

      Vol:
    E97-D No:6
      Page(s):
    1429-1437

    This paper presents an electrolaryngeal (EL) speech enhancement method capable of significantly improving naturalness of EL speech while causing no degradation in its intelligibility. An electrolarynx is an external device that artificially generates excitation sounds to enable laryngectomees to produce EL speech. Although proficient laryngectomees can produce quite intelligible EL speech, it sounds very unnatural due to the mechanical excitation produced by the device. Moreover, the excitation sounds produced by the device often leak outside, adding to EL speech as noise. To address these issues, there are mainly two conventional approached to EL speech enhancement through either noise reduction or statistical voice conversion (VC). The former approach usually causes no degradation in intelligibility but yields only small improvements in naturalness as the mechanical excitation sounds remain essentially unchanged. On the other hand, the latter approach significantly improves naturalness of EL speech using spectral and excitation parameters of natural voices converted from acoustic parameters of EL speech, but it usually causes degradation in intelligibility owing to errors in conversion. We propose a hybrid approach using a noise reduction method for enhancing spectral parameters and statistical voice conversion method for predicting excitation parameters. Moreover, we further modify the prediction process of the excitation parameters to improve its prediction accuracy and reduce adverse effects caused by unvoiced/voiced prediction errors. The experimental results demonstrate the proposed method yields significant improvements in naturalness compared with EL speech while keeping intelligibility high enough.

  • NOCOA+: Multimodal Computer-Based Training for Social and Communication Skills

    Hiroki TANAKA  Sakriani SAKTI  Graham NEUBIG  Tomoki TODA  Satoshi NAKAMURA  

     
    PAPER-Educational Technology

      Pubricized:
    2015/04/28
      Vol:
    E98-D No:8
      Page(s):
    1536-1544

    Non-verbal communication incorporating visual, audio, and contextual information is important to make sense of and navigate the social world. Individuals who have trouble with social situations often have difficulty recognizing these sorts of non-verbal social signals. In this article, we propose a training tool NOCOA+ (Non-verbal COmmuniation for Autism plus) that uses utterances in visual and audio modalities in non-verbal communication training. We describe the design of NOCOA+, and further perform an experimental evaluation in which we examine its potential as a tool for computer-based training of non-verbal communication skills for people with social and communication difficulties. In a series of four experiments, we investigated 1) the effect of temporal context on the ability to recognize social signals in testing context, 2) the effect of modality of presentation of social stimulus on ability to recognize non-verbal information, 3) the correlation between autistic traits as measured by the autism spectrum quotient (AQ) and non-verbal behavior recognition skills measured by NOCOA+, 4) the effectiveness of computer-based training in improving social skills. We found that context information was helpful for recognizing non-verbal behaviors, and the effect of modality was different. The results also showed a significant relationship between the AQ communication and socialization scores and non-verbal communication skills, and that social skills were significantly improved through computer-based training.