IEICE global.ieice.org Site

Keyword Search Result

[Keyword] speech processing(15hit)

1-15hit

Generation and Detection of Media Clones Open Access
Isao ECHIZEN Noboru BABAGUCHI Junichi YAMAGISHI Naoko NITTA Yuta NAKASHIMA Kazuaki NAKAMURA Kazuhiro KONO Fuming FANG Seiko MYOJIN Zhenzhong KUANG Huy H. NGUYEN Ngoc-Dung T. TIEU

INVITED PAPER

Pubricized:
2020/10/19
Vol:
E104-D No:1
Page(s):
12-23
With the spread of high-performance sensors and social network services (SNS) and the remarkable advances in machine learning technologies, fake media such as fake videos, spoofed voices, and fake reviews that are generated using high-quality learning data and are very close to the real thing are causing serious social problems. We launched a research project, the Media Clone (MC) project, to protect receivers of replicas of real media called media clones (MCs) skillfully fabricated by means of media processing technologies. Our aim is to achieve a communication system that can defend against MC attacks and help ensure safe and reliable communication. This paper describes the results of research in two of the five themes in the MC project: 1) verification of the capability of generating various types of media clones such as audio, visual, and text derived from fake information and 2) realization of a protection shield for media clones' attacks by recognizing them.
Erasable Photograph Tagging: A Mobile Application Framework Employing Owner's Voice
Zhenfei ZHAO Hao LUO Hua ZHONG Bian YANG Zhe-Ming LU

LETTER-Speech and Hearing

Vol:
E97-D No:2
Page(s):
370-372
This letter proposes a mobile application framework named erasable photograph tagging (EPT) for photograph annotation and fast retrieval. The smartphone owner's voice is employed as tags and hidden in the host photograph without an extra feature database aided for retrieval. These digitized tags can be erased anytime with no distortion remaining in the recovered photograph.
A New Method for Low SNR Estimation of Noisy Speech Signals Using Fourth-Order Moments
Roghayeh DOOST Abolghasem SAYADIAN Hossein SHAMSI

PAPER-Speech and Hearing

Vol:
E93-D No:6
Page(s):
1599-1607
In this paper the SNR estimation is performed frame by frame, during the speech activity. For this purpose, the fourth-order moments of the real and imaginary parts of frequency components are extracted, for both the speech and noise, separately. For each noisy frame, the mentioned fourth-order moments are also estimated. Making use of the proposed formulas, the signal-to-noise ratio is estimated in each frequency index of the noisy frame. These formulas also predict the overall signal-to-noise ratio in each noisy frame. What makes our method outstanding compared to conventional approaches is that this method takes into consideration both the speech and noise identically. It estimates the negative SNR almost as well as the positive SNR.
Analysis and Synthesis of Emotional Voice Based on Time-Frequency Pitch Distributions
Mamoru KOBAYASHI Shigeo WADA

PAPER

Vol:
E89-A No:8
Page(s):
2100-2106
In this paper, analysis and synthesis methods of emotional voice for man-machine natural interface is developed. First, the emotional voice (neutral, anger, sadness, joy, dislike) is analyzed using time-frequency representation of speech and similarity analysis. Then, based on the result of emotional analysis, a voice with neutral emotion is transformed to synthesize the particular emotional voice using time-frequency modifications. In the simulations, five types of emotion are analyzed using 50 samples of speech signals. The high average discrimination rate is achieved in the similarity analysis. Further, the synthesized emotional voice is subjectively evaluated. It is confirmed that the emotional voice is naturally generated by the proposed time-frequency based approach.
Speech Noise Reduction System Based on Frequency Domain ALE Using Windowed Modified DFT Pair
Isao NAKANISHI Yuudai NAGATA Takenori ASAKURA Yoshio ITOH Yutaka FUKUI

PAPER

Vol:
E89-A No:4
Page(s):
950-959
The speech noise reduction system based on the frequency domain adaptive line enhancer using a windowed modified DFT (MDFT) pair is presented. The adaptive line enhancer (ALE) is effective for extracting sinusoidal signals blurred by a broadband noise. In addition, it utilizes only one microphone. Therefore, it is suitable for the realization of speech noise reduction in portable electronic devices. In the ALE, an input signal is generated by delaying a desired signal using the decorrelation parameter, which makes the noise in the input signal decorrelated with that in the desired one. In the present paper, we propose to set decorrelation parameters in the frequency domain and adjust them to optimal values according to the relationship between speech and noise. Such frequency domain decorrelation parameters enable the reduction of the computational complexity of the proposed system. Also, we introduce the window function into MDFT for suppressing spectral leakage. The performance of the proposed noise reduction system is examined through computer simulations.
Noise-Robust Speech Analysis Using Running Spectrum Filtering
Qi ZHU Noriyuki OHTSUKI Yoshikazu MIYANAGA Norinobu YOSHIDA

PAPER-Speech and Hearing

Vol:
E88-A No:2
Page(s):
541-548
This paper proposes a new robust adaptive processing algorithm that is based on the extended least squares (ELS) method with running spectrum filtering (RSF). By utilizing the different characteristics of running spectra between speech signals and noise signals, RSF can retain speech characteristics while noise is effectively reduced. Then, by using ELS, autoregressive moving average (ARMA) parameters can be estimated accurately. In experiments on real speech contaminated by white Gaussian noise and factory noise, we found that the method we propose offered spectrum estimates that were robust against additive noise.
Level-Building on AdaBoost HMM Classifiers and the Application to Visual Speech Processing
Liang DONG Say-Wei FOO Yong LIAN

PAPER-Speech and Hearing

Vol:
E87-D No:11
Page(s):
2460-2471
The Hidden Markov Model (HMM) is a popular statistical framework for modeling and analyzing stochastic signals. In this paper, a novel strategy is proposed that makes use of level-building algorithm with a chain of AdaBoost HMM classifiers to model long stochastic processes. AdaBoost HMM classifier belongs to the class of multiple-HMM classifier. It is specially trained to identify samples with erratic distributions. By connecting the AdaBoost HMM classifiers, processes of arbitrary length can be modeled. A probability trellis is created to store the accumulated probabilities, starting frames and indices of each reference model. By backtracking the trellis, a sequence of best-matched AdaBoost HMM classifiers can be decoded. The proposed method is applied to visual speech processing. A selected number of words and phrases are decomposed into sequences of visual speech units using both the proposed strategy and the conventional level-building on HMM method. Experimental results show that the proposed strategy is able to more accurately decompose words/phrases in visual speech than the conventional approach.
Neural Predictive Hidden Markov Model for Speech Recognition
Eiichi TSUBOKA Yoshihiro TAKADA

PAPER

Vol:
E78-D No:6
Page(s):
676-684
This paper describes new modeling methods combining neural network and hidden Markov model applicable to modeling a time series such as speech signal. The idea assumes that the sequence is nonstationary and is a nonlinear autoregressive process whose parameters are controlled by a hidden Markov chain. One is the model where a non-linear predictor composed of a multi-layered neural network is defined at each state, another is the model where a multi-layered neural network is defined so that the path from the input layer to the output layer is divided into path-groups each of which corresponds to the state of the Markov chain. The latter is an extended model of the former. The parameter estimation methods for these models are shown, and other previously proposed models--one called Neural Prediction Model and another called Linear Predictive HMM--are shown to be special cases of the NPHMM proposed here. The experimental result affirms the justification of these proposed models.
Unification-Failure Filter for Natural Language
Alfredo M. MAEDA Hideto TOMABECHI Jun-ichi AOE

PAPER-Software Systems

Vol:
E78-D No:1
Page(s):
19-26
Graph unification is doubtlessly the most expensive process in unification-based grammar parsing since it takes the vast majority of the total parsing time of natural language sentences. A parsing time overload in unification consists in that, in general, no less than 60% of the graph unifications performed actually fail. Thus one way to achieve unification time speed-up is focusing on an efficient, fast way to deal with such unification failures. In this paper, a process, prior to unification itself, capable of filtering or stopping a considerably high percentage of graphs that would fail unification is proposed. This unification-filtering process consists of comparison of signatures that correspond to each one of the graphs to be unified. Unification-filter (hereafter UF) is capable of stopping around 87% of the non-unifiable graphs before unification itself takes place. UF takes significantly less time to detect graphs that do not unify and discard them than it would take to unification to fail the attempt to unify the same graphs. As a result of using UF, unification is performed in an around 71% of the time for the fastest known unification algorithm.
M-LCELP Speech Coding at 4kb/s with Multi-Mode and Multi-Codebook
Kazunori OZAWA Masahiro SERIZAWA Toshiki MIYANO Toshiyuki NOMURA Masao IKEKAWA Shin-ichi TAUMI

PAPER

Vol:
E77-B No:9
Page(s):
1114-1121
This paper presents the M-LCELP (Multi-mode Learned Code Excited LPC) speech coder, which has been developed for the next generation half-rate digital cellular telephone systems. M-LCELP develops the following techniques to achieve high-quality synthetic speech at 4kb/s with practically reasonable computation and memory requirements: (1) Multi-mode and multi-codebook coding to improve coding efficiency, (2) Pitch lag differential coding with pitch tracking to reduce lag transmission rate, (3) A two-stage joint design regular-pulse codebook with common phase structure in voiced frames, to drastically reduce computation and memory requirements, (4) An efficient vector quantization for LSP parameters, (5) An adaptive MA type comb filter to suppress excitation signal inter-harmonic noise. The MOS subjective test results demonstrate that 4.075kb/s M-LCELP synthetic speech quality is mostly equivalent to that for a North American full-rate standard VSELP coder. M-LCELP codec requires 18 MOPS computation amount. The codec has been implemented using 2 floating-point dsp chips.
Analysis/Synthesis of Speech Using the Short-Time Fourier Transform and a Time-Varying ARMA Process
Andreas SPANIAS Philipos LOIZOU Gim LIM Ye CHEN Gen HU

PAPER-Speech

Vol:
E76-A No:4
Page(s):
645-652
A speech analysis/synthesis system that relies on a time-varying Auto Regressive Moving Average (ARMA) process and the Short-Time Fourier Transform (STFT) is proposed. The narrowband components in speech are represented in the frequency domain by a set of harmonic components, while the broadband random components are represented by a time-varying ARMA process. The time-varying ARMA model has a dual function, namely, it creates a spectral envelope that fits accurately the harmonic STFT components, and provides for the spectral representation of the broadband components of speech. The proposed model essentially combines the features of waveform coders by employing the STFT and the features of traditional vocoders by incorporating an appropriately shaped noise sequence.
Text-Independent Speaker Recognition Using Neural Networks
Hiroaki HATTORI

PAPER-Speech Processing

Vol:
E76-D No:3
Page(s):
345-351
This paper describes a text-independent speaker recognition method using predictive neural networks. For text-independent speaker recognition, an ergodic model which allows transitions to any other state, including selftransitions, is adopted as the speaker model and one predictive neural network is assigned to each state. The proposed method was compared to quantization distortion based methods, HMM based methods, and a discriminative neural network based method through text-independent speaker identification experiments on 24 female speakers. The proposed method gave the highest identification rate of 100.0%, and the effectiveness of predictive neural networks for representing speaker individuality was clarified.
Automatic Evaluation of English Pronunciation Based on Speech Recognition Techniques
Hiroshi HAMADA Satoshi MIKI Ryohei NAKATSU

PAPER-Speech Processing

Vol:
E76-D No:3
Page(s):
352-359
A new method is proposed for automatically evaluating the English pronunciation quality of non-native speakers. It is assumed that pronunciation can be rated using three criteria: the static characteristics of phonetic spectra, the dynamic structure of spectrum sequences, and the prosodic characteristics of utterances. The evaluation uses speech recognition techniques to compare the English words pronounced by a non-native speaker with those pronounced by a native speaker. Three evaluation measures are proposed to rate pronunciation quality. (1) The standard deviation of the mapping vectors, which map the codebook vectors of the non-native speaker onto the vector space of the native speaker, is used to evaluate the static phonetic spectra characteristics. (2) The spectral distance between words pronounced by the non-native speaker and those pronounced by the native speaker obtained by the DTW method is used to evaluate the dynamic characteristics of spectral sequences. (3) The differences in fundamental frequency and speech power between the pronunciation of the native and non-native speaker are used as the criteria for evaluating prosodic characteristics. Evaluation experiments are carried out using 441 words spoken by 10 Japanese speakers and 10 native speakers. One half of the 441 words was used to evaluate static phonetic spectra characteristics, and the other half was used to evaluate the dynamic characteristics of spectral sequences, as well as the prosodic characteristics. Based on the experimental results, the correlation between the evaluation scores and the scores determined by human judgement is found to be 0.90.
Speaker Adaptation Based on Vector Field Smoothing
Hiroaki HATTORI Shigeki SAGAYAMA

PAPER-Speech Processing

Vol:
E76-D No:2
Page(s):
227-234
This paper describes a new supervised speaker adaptation method based on vector field smoothing, for small size adaptation data. This method assumes that the correspondence of feature vectors between speakers can be viewed as a kind of smooth vector field, and interpolation and smoothing of the correspondence are introduced into the adaptation process for higher adaptation performance with small size data. The proposed adaptation method was applied to discrete HMM based speech recognition and evaluated in Japanese phoneme and phrase recognition experiments. Using 10 words as the adaptation data, the proposed method produced almost the same results as the conventional codebook mapping method with 25 words. These experiments clearly comfirmed the effectiveness of the proposed method.
Speech Coding and Recognition: A Review
Andreas S. SPANIAS Frank H. WU

PAPER

Vol:
E75-A No:2
Page(s):
132-148
The objective of this paper is to provide an overview of the recent developments in the area of speech processing and in particular in the fields of speech coding and speech recognition. The speech coding review covers DPCM coders, model-based vocoders, waveform coders, and hybrid coders. The hybrid coders are described in some detail since they are the subject of current research. Our treatment of speech recognition techniques concentrates on the methodologies for voice recognition and the progress made in speaker independent recognition. In addition, we describe the efforts towards commercial deployment of this technology.

Keyword Search Result

[Keyword] speech processing(15hit)

Generation and Detection of Media Clones Open Access

Erasable Photograph Tagging: A Mobile Application Framework Employing Owner's Voice

A New Method for Low SNR Estimation of Noisy Speech Signals Using Fourth-Order Moments

Analysis and Synthesis of Emotional Voice Based on Time-Frequency Pitch Distributions

Speech Noise Reduction System Based on Frequency Domain ALE Using Windowed Modified DFT Pair

Noise-Robust Speech Analysis Using Running Spectrum Filtering

Level-Building on AdaBoost HMM Classifiers and the Application to Visual Speech Processing

Neural Predictive Hidden Markov Model for Speech Recognition

Unification-Failure Filter for Natural Language

M-LCELP Speech Coding at 4kb/s with Multi-Mode and Multi-Codebook

Analysis/Synthesis of Speech Using the Short-Time Fourier Transform and a Time-Varying ARMA Process

Text-Independent Speaker Recognition Using Neural Networks

Automatic Evaluation of English Pronunciation Based on Speech Recognition Techniques

Speaker Adaptation Based on Vector Field Smoothing

Speech Coding and Recognition: A Review

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles