1-15hit |
Isao ECHIZEN Noboru BABAGUCHI Junichi YAMAGISHI Naoko NITTA Yuta NAKASHIMA Kazuaki NAKAMURA Kazuhiro KONO Fuming FANG Seiko MYOJIN Zhenzhong KUANG Huy H. NGUYEN Ngoc-Dung T. TIEU
With the spread of high-performance sensors and social network services (SNS) and the remarkable advances in machine learning technologies, fake media such as fake videos, spoofed voices, and fake reviews that are generated using high-quality learning data and are very close to the real thing are causing serious social problems. We launched a research project, the Media Clone (MC) project, to protect receivers of replicas of real media called media clones (MCs) skillfully fabricated by means of media processing technologies. Our aim is to achieve a communication system that can defend against MC attacks and help ensure safe and reliable communication. This paper describes the results of research in two of the five themes in the MC project: 1) verification of the capability of generating various types of media clones such as audio, visual, and text derived from fake information and 2) realization of a protection shield for media clones' attacks by recognizing them.
Zhenfei ZHAO Hao LUO Hua ZHONG Bian YANG Zhe-Ming LU
This letter proposes a mobile application framework named erasable photograph tagging (EPT) for photograph annotation and fast retrieval. The smartphone owner's voice is employed as tags and hidden in the host photograph without an extra feature database aided for retrieval. These digitized tags can be erased anytime with no distortion remaining in the recovered photograph.
Roghayeh DOOST Abolghasem SAYADIAN Hossein SHAMSI
In this paper the SNR estimation is performed frame by frame, during the speech activity. For this purpose, the fourth-order moments of the real and imaginary parts of frequency components are extracted, for both the speech and noise, separately. For each noisy frame, the mentioned fourth-order moments are also estimated. Making use of the proposed formulas, the signal-to-noise ratio is estimated in each frequency index of the noisy frame. These formulas also predict the overall signal-to-noise ratio in each noisy frame. What makes our method outstanding compared to conventional approaches is that this method takes into consideration both the speech and noise identically. It estimates the negative SNR almost as well as the positive SNR.
In this paper, analysis and synthesis methods of emotional voice for man-machine natural interface is developed. First, the emotional voice (neutral, anger, sadness, joy, dislike) is analyzed using time-frequency representation of speech and similarity analysis. Then, based on the result of emotional analysis, a voice with neutral emotion is transformed to synthesize the particular emotional voice using time-frequency modifications. In the simulations, five types of emotion are analyzed using 50 samples of speech signals. The high average discrimination rate is achieved in the similarity analysis. Further, the synthesized emotional voice is subjectively evaluated. It is confirmed that the emotional voice is naturally generated by the proposed time-frequency based approach.
Isao NAKANISHI Yuudai NAGATA Takenori ASAKURA Yoshio ITOH Yutaka FUKUI
The speech noise reduction system based on the frequency domain adaptive line enhancer using a windowed modified DFT (MDFT) pair is presented. The adaptive line enhancer (ALE) is effective for extracting sinusoidal signals blurred by a broadband noise. In addition, it utilizes only one microphone. Therefore, it is suitable for the realization of speech noise reduction in portable electronic devices. In the ALE, an input signal is generated by delaying a desired signal using the decorrelation parameter, which makes the noise in the input signal decorrelated with that in the desired one. In the present paper, we propose to set decorrelation parameters in the frequency domain and adjust them to optimal values according to the relationship between speech and noise. Such frequency domain decorrelation parameters enable the reduction of the computational complexity of the proposed system. Also, we introduce the window function into MDFT for suppressing spectral leakage. The performance of the proposed noise reduction system is examined through computer simulations.
Qi ZHU Noriyuki OHTSUKI Yoshikazu MIYANAGA Norinobu YOSHIDA
This paper proposes a new robust adaptive processing algorithm that is based on the extended least squares (ELS) method with running spectrum filtering (RSF). By utilizing the different characteristics of running spectra between speech signals and noise signals, RSF can retain speech characteristics while noise is effectively reduced. Then, by using ELS, autoregressive moving average (ARMA) parameters can be estimated accurately. In experiments on real speech contaminated by white Gaussian noise and factory noise, we found that the method we propose offered spectrum estimates that were robust against additive noise.
Liang DONG Say-Wei FOO Yong LIAN
The Hidden Markov Model (HMM) is a popular statistical framework for modeling and analyzing stochastic signals. In this paper, a novel strategy is proposed that makes use of level-building algorithm with a chain of AdaBoost HMM classifiers to model long stochastic processes. AdaBoost HMM classifier belongs to the class of multiple-HMM classifier. It is specially trained to identify samples with erratic distributions. By connecting the AdaBoost HMM classifiers, processes of arbitrary length can be modeled. A probability trellis is created to store the accumulated probabilities, starting frames and indices of each reference model. By backtracking the trellis, a sequence of best-matched AdaBoost HMM classifiers can be decoded. The proposed method is applied to visual speech processing. A selected number of words and phrases are decomposed into sequences of visual speech units using both the proposed strategy and the conventional level-building on HMM method. Experimental results show that the proposed strategy is able to more accurately decompose words/phrases in visual speech than the conventional approach.
Eiichi TSUBOKA Yoshihiro TAKADA
This paper describes new modeling methods combining neural network and hidden Markov model applicable to modeling a time series such as speech signal. The idea assumes that the sequence is nonstationary and is a nonlinear autoregressive process whose parameters are controlled by a hidden Markov chain. One is the model where a non-linear predictor composed of a multi-layered neural network is defined at each state, another is the model where a multi-layered neural network is defined so that the path from the input layer to the output layer is divided into path-groups each of which corresponds to the state of the Markov chain. The latter is an extended model of the former. The parameter estimation methods for these models are shown, and other previously proposed models--one called Neural Prediction Model and another called Linear Predictive HMM--are shown to be special cases of the NPHMM proposed here. The experimental result affirms the justification of these proposed models.
Alfredo M. MAEDA Hideto TOMABECHI Jun-ichi AOE
Graph unification is doubtlessly the most expensive process in unification-based grammar parsing since it takes the vast majority of the total parsing time of natural language sentences. A parsing time overload in unification consists in that, in general, no less than 60% of the graph unifications performed actually fail. Thus one way to achieve unification time speed-up is focusing on an efficient, fast way to deal with such unification failures. In this paper, a process, prior to unification itself, capable of filtering or stopping a considerably high percentage of graphs that would fail unification is proposed. This unification-filtering process consists of comparison of signatures that correspond to each one of the graphs to be unified. Unification-filter (hereafter UF) is capable of stopping around 87% of the non-unifiable graphs before unification itself takes place. UF takes significantly less time to detect graphs that do not unify and discard them than it would take to unification to fail the attempt to unify the same graphs. As a result of using UF, unification is performed in an around 71% of the time for the fastest known unification algorithm.
Kazunori OZAWA Masahiro SERIZAWA Toshiki MIYANO Toshiyuki NOMURA Masao IKEKAWA Shin-ichi TAUMI
This paper presents the M-LCELP (Multi-mode Learned Code Excited LPC) speech coder, which has been developed for the next generation half-rate digital cellular telephone systems. M-LCELP develops the following techniques to achieve high-quality synthetic speech at 4kb/s with practically reasonable computation and memory requirements: (1) Multi-mode and multi-codebook coding to improve coding efficiency, (2) Pitch lag differential coding with pitch tracking to reduce lag transmission rate, (3) A two-stage joint design regular-pulse codebook with common phase structure in voiced frames, to drastically reduce computation and memory requirements, (4) An efficient vector quantization for LSP parameters, (5) An adaptive MA type comb filter to suppress excitation signal inter-harmonic noise. The MOS subjective test results demonstrate that 4.075kb/s M-LCELP synthetic speech quality is mostly equivalent to that for a North American full-rate standard VSELP coder. M-LCELP codec requires 18 MOPS computation amount. The codec has been implemented using 2 floating-point dsp chips.
Andreas SPANIAS Philipos LOIZOU Gim LIM Ye CHEN Gen HU
A speech analysis/synthesis system that relies on a time-varying Auto Regressive Moving Average (ARMA) process and the Short-Time Fourier Transform (STFT) is proposed. The narrowband components in speech are represented in the frequency domain by a set of harmonic components, while the broadband random components are represented by a time-varying ARMA process. The time-varying ARMA model has a dual function, namely, it creates a spectral envelope that fits accurately the harmonic STFT components, and provides for the spectral representation of the broadband components of speech. The proposed model essentially combines the features of waveform coders by employing the STFT and the features of traditional vocoders by incorporating an appropriately shaped noise sequence.
This paper describes a text-independent speaker recognition method using predictive neural networks. For text-independent speaker recognition, an ergodic model which allows transitions to any other state, including selftransitions, is adopted as the speaker model and one predictive neural network is assigned to each state. The proposed method was compared to quantization distortion based methods, HMM based methods, and a discriminative neural network based method through text-independent speaker identification experiments on 24 female speakers. The proposed method gave the highest identification rate of 100.0%, and the effectiveness of predictive neural networks for representing speaker individuality was clarified.
Hiroshi HAMADA Satoshi MIKI Ryohei NAKATSU
A new method is proposed for automatically evaluating the English pronunciation quality of non-native speakers. It is assumed that pronunciation can be rated using three criteria: the static characteristics of phonetic spectra, the dynamic structure of spectrum sequences, and the prosodic characteristics of utterances. The evaluation uses speech recognition techniques to compare the English words pronounced by a non-native speaker with those pronounced by a native speaker. Three evaluation measures are proposed to rate pronunciation quality. (1) The standard deviation of the mapping vectors, which map the codebook vectors of the non-native speaker onto the vector space of the native speaker, is used to evaluate the static phonetic spectra characteristics. (2) The spectral distance between words pronounced by the non-native speaker and those pronounced by the native speaker obtained by the DTW method is used to evaluate the dynamic characteristics of spectral sequences. (3) The differences in fundamental frequency and speech power between the pronunciation of the native and non-native speaker are used as the criteria for evaluating prosodic characteristics. Evaluation experiments are carried out using 441 words spoken by 10 Japanese speakers and 10 native speakers. One half of the 441 words was used to evaluate static phonetic spectra characteristics, and the other half was used to evaluate the dynamic characteristics of spectral sequences, as well as the prosodic characteristics. Based on the experimental results, the correlation between the evaluation scores and the scores determined by human judgement is found to be 0.90.
Hiroaki HATTORI Shigeki SAGAYAMA
This paper describes a new supervised speaker adaptation method based on vector field smoothing, for small size adaptation data. This method assumes that the correspondence of feature vectors between speakers can be viewed as a kind of smooth vector field, and interpolation and smoothing of the correspondence are introduced into the adaptation process for higher adaptation performance with small size data. The proposed adaptation method was applied to discrete HMM based speech recognition and evaluated in Japanese phoneme and phrase recognition experiments. Using 10 words as the adaptation data, the proposed method produced almost the same results as the conventional codebook mapping method with 25 words. These experiments clearly comfirmed the effectiveness of the proposed method.
Andreas S. SPANIAS Frank H. WU
The objective of this paper is to provide an overview of the recent developments in the area of speech processing and in particular in the fields of speech coding and speech recognition. The speech coding review covers DPCM coders, model-based vocoders, waveform coders, and hybrid coders. The hybrid coders are described in some detail since they are the subject of current research. Our treatment of speech recognition techniques concentrates on the methodologies for voice recognition and the progress made in speaker independent recognition. In addition, we describe the efforts towards commercial deployment of this technology.