The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] voice(140hit)

1-20hit(140hit)

  • Reservoir-Based 1D Convolution: Low-Training-Cost AI Open Access

    Yuichiro TANAKA  Hakaru TAMUKOH  

     
    LETTER-Neural Networks and Bioengineering

      Pubricized:
    2023/09/11
      Vol:
    E107-A No:6
      Page(s):
    941-944

    In this study, we introduce a reservoir-based one-dimensional (1D) convolutional neural network that processes time-series data at a low computational cost, and investigate its performance and training time. Experimental results show that the proposed network consumes lower training computational costs and that it outperforms the conventional reservoir computing in a sound-classification task.

  • INmfCA Algorithm for Training of Nonparallel Voice Conversion Systems Based on Non-Negative Matrix Factorization

    Hitoshi SUDA  Gaku KOTANI  Daisuke SAITO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2022/03/03
      Vol:
    E105-D No:6
      Page(s):
    1196-1210

    In this paper, we propose a new training framework named the INmfCA algorithm for nonparallel voice conversion (VC) systems. To train conversion models, traditional VC frameworks require parallel corpora, in which source and target speakers utter the same linguistic contents. Although the frameworks have achieved high-quality VC, they are not applicable in situations where parallel corpora are unavailable. To acquire conversion models without parallel corpora, nonparallel methods are widely studied. Although the frameworks achieve VC under nonparallel conditions, they tend to require huge background knowledge or many training utterances. This is because of difficulty in disentangling linguistic and speaker information without a large amount of data. In this work, we tackle this problem by exploiting NMF, which can factorize acoustic features into time-variant and time-invariant components in an unsupervised manner. The method acquires alignment between the acoustic features of a source speaker's utterances and a target dictionary and uses the obtained alignment as activation of NMF to train the source speaker's dictionary without parallel corpora. The acquisition method is based on the INCA algorithm, which obtains the alignment of nonparallel corpora. In contrast to the INCA algorithm, the alignment is not restricted to observed samples, and thus the proposed method can efficiently utilize small nonparallel corpora. The results of subjective experiments show that the combination of the proposed algorithm and the INCA algorithm outperformed not only an INCA-based nonparallel framework but also CycleGAN-VC, which performs nonparallel VC without any additional training data. The results also indicate that a one-shot VC framework, which does not need to train source speakers, can be constructed on the basis of the proposed method.

  • Analysis against Security Issues of Voice over 5G

    Hyungjin CHO  Seongmin PARK  Youngkwon PARK  Bomin CHOI  Dowon KIM  Kangbin YIM  

     
    PAPER

      Pubricized:
    2021/07/13
      Vol:
    E104-D No:11
      Page(s):
    1850-1856

    In Feb 2021, As the competition for commercialization of 5G mobile communication has been increasing, 5G SA Network and Vo5G are expected to be commercialized soon. 5G mobile communication aims to provide 20 Gbps transmission speed which is 20 times faster than 4G mobile communication, connection of at least 1 million devices per 1 km2, and 1 ms transmission delay which is 10 times shorter than 4G. To meet this, various technological developments were required, and various technologies such as Massive MIMO (Multiple-Input and Multiple-Output), mmWave, and small cell network were developed and applied in the area of 5G access network. However, in the core network area, the components constituting the LTE (Long Term Evolution) core network are utilized as they are in the NSA (Non-Standalone) architecture, and only the changes in the SA (Standalone) architecture have occurred. Also, in the network area for providing the voice service, the IMS (IP Multimedia Subsystem) infrastructure is still used in the SA architecture. Here, the issue is that while 5G mobile communication is evolving openly to provide various services, security elements are vulnerable to various cyber-attacks because they maintain the same form as before. Therefore, in this paper, we will look at what the network standard for 5G voice service provision consists of, and what are the vulnerable problems in terms of security. And We Suggest Possible Attack Scenario using Security Issue, We also want to consider whether these problems can actually occur and what is the countermeasure.

  • Real-Time Full-Band Voice Conversion with Sub-Band Modeling and Data-Driven Phase Estimation of Spectral Differentials Open Access

    Takaaki SAEKI  Yuki SAITO  Shinnosuke TAKAMICHI  Hiroshi SARUWATARI  

     
    PAPER-Speech and Hearing

      Pubricized:
    2021/04/16
      Vol:
    E104-D No:7
      Page(s):
    1002-1016

    This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectral-differential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a lifter-training method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multi-rate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.

  • Development and Effectiveness Evaluation of Interactive Voice HMI System

    Chiharu KATAOKA  Osamu KUKIMOTO  Yuichiro YOSHIKAWA  Kohei OGAWA  Hiroshi ISHIGURO  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2021/01/13
      Vol:
    E104-D No:4
      Page(s):
    500-507

    Connected services have been under development in the automotive industry. Meanwhile, the volume of predictive notifications that utilize travel-related data is increasing, and there are concerns that drivers cannot process such an amount of information or do not accept and follow such predictive instructions straightforwardly because the information provided is predicted. In this work, an interactive voice system using two agents is proposed to realize notifications that can easily be accepted by drivers and enhance the reliability of the system by adding contextual information. An experiment was performed using a driving simulator to compare the following three forms of notifications: (1) notification with no contextual information, (2) notification with contextual information using one agent, and (3) notification with contextual information using two agents. The notification content was limited to probable near-miss incidents. The results of the experiment indicate that the driver may decelerate more with the one- and two-agent notification methods than with the conventional notification method. The degree of deceleration depended the number of times the notification was provided and whether there were cars parked on the streets.

  • Real-Time Distant Sound Source Suppression Using Spectral Phase Difference

    Kazuhiro MURAKAMI  Arata KAWAMURA  Yoh-ichi FUJISAKA  Nobuhiko HIRUMA  Youji IIGUNI  

     
    PAPER-Engineering Acoustics

      Pubricized:
    2020/09/24
      Vol:
    E104-A No:3
      Page(s):
    604-612

    In this paper, we propose a real-time BSS (Blind Source Separation) system with two microphones that extracts only desired sound sources. Under the assumption that the desired sound sources are close to the microphones, the proposed BSS system suppresses distant sound sources as undesired sound sources. We previously developed a BSS system that can estimate the distance from a microphone to a sound source and suppress distant sound sources, but it was not a real-time processing system. The proposed BSS system is a real-time version of our previous BSS system. To develop the proposed BSS system, we simplify some BSS procedures of the previous system. Simulation results showed that the proposed system can effectively suppress the distant source signals in real-time and has almost the same capability as the previous system.

  • Speech Chain VC: Linking Linguistic and Acoustic Levels via Latent Distinctive Features for RBM-Based Voice Conversion

    Takuya KISHIDA  Toru NAKASHIKA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/08/06
      Vol:
    E103-D No:11
      Page(s):
    2340-2350

    This paper proposes a voice conversion (VC) method based on a model that links linguistic and acoustic representations via latent phonological distinctive features. Our method, called speech chain VC, is inspired by the concept of the speech chain, where speech communication consists of a chain of events linking the speaker's brain with the listener's brain. We assume that speaker identity information, which appears in the acoustic level, is embedded in two steps — where phonological information is encoded into articulatory movements (linguistic to physiological) and where articulatory movements generate sound waves (physiological to acoustic). Speech chain VC represents these event links by using an adaptive restricted Boltzmann machine (ARBM) introducing phoneme labels and acoustic features as two classes of visible units and latent phonological distinctive features associated with articulatory movements as hidden units. Subjective evaluation experiments showed that intelligibility of the converted speech significantly improved compared with the conventional ARBM-based method. The speaker-identity conversion quality of the proposed method was comparable to that of a Gaussian mixture model (GMM)-based method. Analyses on the representations of the hidden layer of the speech chain VC model supported that some of the hidden units actually correspond to phonological distinctive features. Final part of this paper proposes approaches to achieve one-shot VC by using the speech chain VC model. Subjective evaluation experiments showed that when a target speaker is the same gender as a source speaker, the proposed methods can achieve one-shot VC based on each single source and target speaker's utterance.

  • Joint Adversarial Training of Speech Recognition and Synthesis Models for Many-to-One Voice Conversion Using Phonetic Posteriorgrams

    Yuki SAITO  Kei AKUZAWA  Kentaro TACHIBANA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/06/12
      Vol:
    E103-D No:9
      Page(s):
    1978-1987

    This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.

  • Deep Learning Approaches for Pathological Voice Detection Using Heterogeneous Parameters

    JiYeoun LEE  Hee-Jin CHOI  

     
    LETTER-Speech and Hearing

      Pubricized:
    2020/05/14
      Vol:
    E103-D No:8
      Page(s):
    1920-1923

    We propose a deep learning-based model for classifying pathological voices using a convolutional neural network and a feedforward neural network. The model uses combinations of heterogeneous parameters, including mel-frequency cepstral coefficients, linear predictive cepstral coefficients and higher-order statistics. We validate the accuracy of this model using the Massachusetts Eye and Ear Infirmary (MEEI) voice disorder database and the Saarbruecken Voice Database (SVD). Our model achieved an accuracy of 99.3% for MEEI and 75.18% for SVD. This model achieved an accuracy that is 7.18% higher than that of competitive models in previous studies.

  • Tensor Factor Analysis for Arbitrary Speaker Conversion

    Daisuke SAITO  Nobuaki MINEMATSU  Keikichi HIROSE  

     
    PAPER-Speech and Hearing

      Pubricized:
    2020/03/13
      Vol:
    E103-D No:6
      Page(s):
    1395-1405

    This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of multiple Gaussian mixture models (GMM). In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice GMM (EV-GMM) was proposed. In the EVC, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this paper, we revisit construction of the speaker space by introducing the tensor factor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the dimension of the mean vector and the Gaussian component. The speaker space is derived by the tensor factor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. In addition, in this paper, effects of speaker adaptive training before factorization are also investigated. Experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.

  • Voice Conversion for Improving Perceived Likability of Uttered Speech

    Shinya HORIIKE  Masanori MORISE  

     
    LETTER-Speech and Hearing

      Pubricized:
    2020/01/23
      Vol:
    E103-D No:5
      Page(s):
    1199-1202

    To improve the likability of speech, we propose a voice conversion algorithm by controlling the fundamental frequency (F0) and the spectral envelope and carry out a subjective evaluation. The subjects can manipulate these two speech parameters. From the result, the subjects preferred speech with a parameter related to higher brightness.

  • Generative Moment Matching Network-Based Neural Double-Tracking for Synthesized and Natural Singing Voices

    Hiroki TAMARU  Yuki SAITO  Shinnosuke TAKAMICHI  Tomoki KORIYAMA  Hiroshi SARUWATARI  

     
    PAPER-Speech and Hearing

      Pubricized:
    2019/12/23
      Vol:
    E103-D No:3
      Page(s):
    639-647

    This paper proposes a generative moment matching network (GMMN)-based post-filtering method for providing inter-utterance pitch variation to singing voices and discusses its application to our developed mixing method called neural double-tracking (NDT). When a human singer sings and records the same song twice, there is a difference between the two recordings. The difference, which is called inter-utterance variation, enriches the performer's musical expression and the audience's experience. For example, it makes every concert special because it never recurs in exactly the same manner. Inter-utterance variation enables a mixing method called double-tracking (DT). With DT, the same phrase is recorded twice, then the two recordings are mixed to give richness to singing voices. However, in synthesized singing voices, which are commonly used to create music, there is no inter-utterance variation because the synthesis process is deterministic. There is also no inter-utterance variation when only one voice is recorded. Although there is a signal processing-based method called artificial DT (ADT) to layer singing voices, the signal processing results in unnatural sound artifacts. To solve these problems, we propose a post-filtering method for randomly modulating synthesized or natural singing voices as if the singer sang again. The post-filter built with our method models the inter-utterance pitch variation of human singing voices using a conditional GMMN. Evaluation results indicate that 1) the proposed method provides perceptible and natural inter-utterance variation to synthesized singing voices and that 2) our NDT exhibits higher double-trackedness than ADT when applied to both synthesized and natural singing voices.

  • HMM-Based Maximum Likelihood Frame Alignment for Voice Conversion from a Nonparallel Corpus

    Ki-Seung LEE  

     
    LETTER-Speech and Hearing

      Pubricized:
    2017/08/23
      Vol:
    E100-D No:12
      Page(s):
    3064-3067

    One of the problems associated with voice conversion from a nonparallel corpus is how to find the best match or alignment between the source and the target vector sequences without linguistic information. In a previous study, alignment was achieved by minimizing the distance between the source vector and the transformed vector. This method, however, yielded a sequence of feature vectors that were not well matched with the underlying speaker model. In this letter, the vectors were selected from the candidates by maximizing the overall likelihood of the selected vectors with respect to the target model in the HMM context. Both objective and subjective evaluations were carried out using the CMU ARCTIC database to verify the effectiveness of the proposed method.

  • A Vibration Control Method of an Electrolarynx Based on Statistical F0 Pattern Prediction

    Kou TANAKA  Tomoki TODA  Satoshi NAKAMURA  

     
    PAPER-Rehabilitation Engineering and Assistive Technology

      Pubricized:
    2017/05/23
      Vol:
    E100-D No:9
      Page(s):
    2165-2173

    This paper presents a novel speaking aid system to help laryngectomees produce more naturally sounding electrolaryngeal (EL) speech. An electrolarynx is an external device to generate excitation signals, instead of vibration of the vocal folds. Although the conventional EL speech is quite intelligible, its naturalness suffers from the unnatural fundamental frequency (F0) patterns of the mechanically generated excitation signals. To improve the naturalness of EL speech, we have proposed EL speech enhancement methods using statistical F0 pattern prediction. In these methods, the original EL speech recorded by a microphone is presented from a loudspeaker after performing the speech enhancement. These methods are effective for some situation, such as telecommunication, but it is not suitable for face-to-face conversation because not only the enhanced EL speech but also the original EL speech is presented to listeners. In this paper, to develop an EL speech enhancement also effective for face-to-face conversation, we propose a method for directly controlling F0 patterns of the excitation signals to be generated from the electrolarynx using the statistical F0 prediction. To get an "actual feel” of the proposed system, we also implement a prototype system. By using the prototype system, we find latency issues caused by a real-time processing. To address these latency issues, we furthermore propose segmental continuous F0 pattern modeling and forthcoming F0 pattern modeling. With evaluations through simulation, we demonstrate that our proposed system is capable of effectively addressing the issues of latency and those of electrolarynx in term of the naturalness.

  • Voice Conversion Using Input-to-Output Highway Networks

    Yuki SAITO  Shinnosuke TAKAMICHI  Hiroshi SARUWATARI  

     
    LETTER-Speech and Hearing

      Pubricized:
    2017/04/28
      Vol:
    E100-D No:8
      Page(s):
    1925-1928

    This paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech.

  • Robust Singing Transcription System Using Local Homogeneity in the Harmonic Structure

    Hoon HEO  Kyogu LEE  

     
    PAPER-Music Information Processing

      Pubricized:
    2017/02/18
      Vol:
    E100-D No:5
      Page(s):
    1114-1123

    Automatic music transcription from audio has long been one of the most intriguing problems and a challenge in the field of music information retrieval, because it requires a series of low-level tasks such as onset/offset detection and F0 estimation, followed by high-level post-processing for symbolic representation. In this paper, a comprehensive transcription system for monophonic singing voice based on harmonic structure analysis is proposed. Given a precise tracking of the fundamental frequency, a novel acoustic feature is derived to signify the harmonic structure in singing voice signals, regardless of the loudness and pitch. It is then used to generate a parametric mixture model based on the von Mises-Fisher distribution, so that the model represents the intrinsic harmonic structures within a region of smoothly connected notes. To identify the note boundaries, the local homogeneity in the harmonic structure is exploited by two different methods: the self-similarity analysis and hidden Markov model. The proposed system identifies the note attributes including the onset time, duration and note pitch. Evaluations are conducted from various aspects to verify the performance improvement of the proposed system and its robustness, using the latest evaluation methodology for singing transcription. The results show that the proposed system significantly outperforms other systems including the state-of-the-art systems.

  • An Improved Perceptual MBSS Noise Reduction with an SNR-Based VAD for a Fully Operational Digital Hearing Aid

    Zhaoyang GUO  Xin'an WANG  Bo WANG  Shanshan YONG  

     
    PAPER-Speech and Hearing

      Pubricized:
    2017/02/17
      Vol:
    E100-D No:5
      Page(s):
    1087-1096

    This paper first reviews the state-of-the-art noise reduction methods and points out their vulnerability in noise reduction performance and speech quality, especially under the low signal-noise ratios (SNR) environments. Then this paper presents an improved perceptual multiband spectral subtraction (MBSS) noise reduction algorithm (NRA) and a novel robust voice activity detection (VAD) based on the amended sub-band SNR. The proposed SNR-based VAD can considerably increase the accuracy of discrimination between noise and speech frame. The simulation results show that the proposed NRA has better segmental SNR (segSNR) and perceptual evaluation of speech quality (PESQ) performance than other noise reduction algorithms especially under low SNR environments. In addition, a fully operational digital hearing aid chip is designed and fabricated in the 0.13 µm CMOS process based on the proposed NRA. The final chip implementation shows that the whole chip dissipates 1.3 mA at the 1.2 V operation. The acoustic test result shows that the maximum output sound pressure level (OSPL) is 114.6 dB SPL, the equivalent input noise is 5.9 dB SPL, and the total harmonic distortion is 2.5%. So the proposed digital hearing aid chip is a promising candidate for high performance hearing-aid systems.

  • Improvements of Voice Timbre Control Based on Perceived Age in Singing Voice Conversion

    Kazuhiro KOBAYASHI  Tomoki TODA  Tomoyasu NAKANO  Masataka GOTO  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2016/07/21
      Vol:
    E99-D No:11
      Page(s):
    2767-2777

    As one of the techniques enabling individual singers to produce the varieties of voice timbre beyond their own physical constraints, a statistical voice timbre control technique based on the perceived age has been developed. In this technique, the perceived age of a singing voice, which is the age of the singer as perceived by the listener, is used as one of the intuitively understandable measures to describe voice characteristics of the singing voice. The use of statistical voice conversion (SVC) with a singer-dependent multiple-regression Gaussian mixture model (MR-GMM), which effectively models the voice timbre variations caused by a change of the perceived age, makes it possible for individual singers to manipulate the perceived ages of their own singing voices while retaining their own singer identities. However, there still remain several issues; e.g., 1) a controllable range of the perceived age is limited; 2) quality of the converted singing voice is significantly degraded compared to that of a natural singing voice; and 3) each singer needs to sing the same phrase set as sung by a reference singer to develop the singer-dependent MR-GMM. To address these issues, we propose the following three methods; 1) a method using gender-dependent modeling to expand the controllable range of the perceived age; 2) a method using direct waveform modification based on spectrum differential to improve quality of the converted singing voice; and 3) a rapid unsupervised adaptation method based on maximum a posteriori (MAP) estimation to easily develop the singer-dependent MR-GMM. The experimental results show that the proposed methods achieve a wider controllable range of the perceived age, a significant quality improvement of the converted singing voice, and the development of the singer-dependnet MR-GMM using only a few arbitrary phrases as adaptation data.

  • Harmonic-Based Robust Voice Activity Detection for Enhanced Low SNR Noisy Speech Recognition System

    Po-Yi SHIH  Po-Chuan LIN  Jhing-Fa WANG  

     
    PAPER-Speech and Hearing

      Vol:
    E99-A No:11
      Page(s):
    1928-1936

    This paper describes a novel harmonic-based robust voice activity detection (H-RVAD) method with harmonic spectral local peak (HSLP) feature. HSLP is extracted by spectral amplitude analysis between the adjacent formants, and such characteristic can be used to identify and verify audio stream containing meaningful human speech accurately in low SNR environment. And, an enhanced low SNR noisy speech recognition system framework with wakeup module, speech recognition module and confirmation module is proposed. Users can determine or reject the system feedback while a recognition result was given in the framework, to prevent any chance that the voiced noise misleads the recognition result. The H-RVAD method is evaluated by the AURORA2 corpus in eight types of noise and three SNR levels and increased overall average performance from 4% to 20%. In home noise, the performance of H-RVAD method can be performed from 4% to 14% sentence recognition rate in average.

  • Statistical Bandwidth Extension for Speech Synthesis Based on Gaussian Mixture Model with Sub-Band Basis Spectrum Model

    Yamato OHTANI  Masatsune TAMURA  Masahiro MORITA  Masami AKAMINE  

     
    PAPER-Voice conversion

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2481-2489

    This paper describes a novel statistical bandwidth extension (BWE) technique based on a Gaussian mixture model (GMM) and a sub-band basis spectrum model (SBM), in which each dimensional component represents a specific acoustic space in the frequency domain. The proposed method can achieve the BWE from speech data with an arbitrary frequency bandwidth whereas the conventional methods perform the conversion from fixed narrow-band data. In the proposed method, we train a GMM with SBM parameters extracted from full-band spectra in advance. According to the bandwidth of input signal, the trained GMM is reconstructed to the GMM of the joint probability density between low-band SBM and high-band SBM components. Then high-band SBM components are estimated from low-band SBM components of the input signal based on the reconstructed GMM. Finally, BWE is achieved by adding the spectra decoded from estimated high-band SBM components to the ones of the input signal. To construct the full-band signal from the narrow-band one, we apply this method to log-amplitude spectra and aperiodic components. Objective and subjective evaluation results show that the proposed method extends the bandwidth of speech data robustly for the log-amplitude spectra. Experimental results also indicate that the aperiodic component extracted from the upsampled narrow-band signal realizes the same performance as the restored and the full-band aperiodic components in the proposed method.

1-20hit(140hit)