IEICE global.ieice.org Site

Keyword Search Result

[Keyword] SPE(2504hit)

1681-1700hit(2504hit)

Intra-Channel Nonlinearities and Dispersion-Management in Highly Dispersed Transmission
Sang-Gyu PARK Je-Myung JEONG

PAPER-Fiber-Optic Transmission

Vol:
E86-B No:4
Page(s):
1205-1211
This study is a detailed numerical investigation on the relations between the performance of the RZ format single-channel transmission, and the chromatic dispersion of transmission fiber and pre-compensation ratio. We observed the transition from the SPM dominant low dispersion region to the intra-channel nonlinearities dominant high dispersion region, and found that the EOP is very sensitive to the pre-compensation ratio when the dispersion assumes a intermediate value. Furthermore, by analyzing the optical power-dependence of the EOP and other nonlinear impairments, we found that the amplitude fluctuation resulting from IFWM is dominant in determining the EOP in the transmission systems employing highly dispersed pulses.
Robust Model for Speaker Verification against Session-Dependent Utterance Variation
Tomoko MATSUI Kiyoaki AIKAWA

PAPER-Speech and Hearing

Vol:
E86-D No:4
Page(s):
712-718
This paper investigates a new method for creating robust speaker models to cope with inter-session variation of a speaker in a continuous HMM-based speaker verification system. The new method estimates session-independent parameters by decomposing inter-session variations into two distinct parts: session-dependent and -independent. The parameters of the speaker models are estimated using the speaker adaptive training algorithm in conjunction with the equalization of session-dependent variation. The resultant models capture the session-independent speaker characteristics more reliably than the conventional models and their discriminative power improves accordingly. Moreover we have made our models more invariant to handset variations in a public switched telephone network (PSTN) by focusing on session-dependent variation and handset-dependent distortion separately. Text-independent speech data recorded by 20 speakers in seven sessions over 16 months was used to evaluate the new approach. The proposed method reduces the error rate by 15% relatively. When compared with the popular cepstral mean normalization, the error rate is reduced by 24% relatively when the speaker models were recreated using speech data recorded in four or more sessions.
Cancellation of Narrowband Interference in GPS Receivers Using NDEKF-Based Recurrent Neural Network Predictors
Wei-Lung MAO Hen-Wai TSAO Fan-Ren CHANG

LETTER-Spread Spectrum Technologies and Applications

Vol:
E86-A No:4
Page(s):
954-960
GPS receivers are susceptible to jamming by interference. This paper proposes a recurrent neural network (RNN) predictor for new application in GPS anti-jamming systems. Five types of narrowband jammers, i. e. AR process, continuous wave interference (CWI), multi-tone CWI, swept CWI, and pulsed CWI, are considered in order to emulate realistic conditions. As the observation noise of received signals is highly non-Gaussian, an RNN estimator with a nonlinear structure is employed to accurately predict the narrowband signals based on a real-time learning method. The node decoupled extended Kalman filter (NDEKF) algorithm is adopted to achieve better performance in terms of convergence rate and quality of solution while requiring less computation time and memory. We analyze the computational complexity and memory requirements of the NDEKF approach and compare them to the global extended Kalman filter (GEKF) training paradigm. Simulation results show that our proposed scheme achieves a superior performance to conventional linear/nonlinear predictors in terms of SNR improvement and mean squared prediction error (MSPE) while providing inherent protection against a broad class of interference environments.
Signal Integrity Design and Analysis for a 400 MHz RISC Microcontroller
Akira YAMADA Yasuhiro NUNOMURA Hiroaki SUZUKI Hisakazu SATO Niichi ITOH Tetsuya KAGEMOTO Hironobu ITO Takashi KURAFUJI Nobuharu YOSHIOKA Jingo NAKANISHI Hiromi NOTANI Rei AKIYAMA Atsushi IWABU Tadao YAMANAKA Hidehiro TAKATA Takeshi SHIBAGAKI Takahiko ARAKAWA Hiroshi MAKINO Osamu TOMISAWA Shuhei IWADE

PAPER-Design Methods and Implementation

Vol:
E86-C No:4
Page(s):
635-642
A high-speed 32-bit RISC microcontroller has been developed. In order to realize high-speed operation with minimum hardware resource, we have developed new design and analysis methods such as a clock distribution, a bus-line layout, and an IR drop analysis. As a result, high-speed operation of 400 MHz has been achieved with power dissipation of 0.96 W at 1.8 V.
Speaker Tracking for Hands-Free Continuous Speech Recognition in Noise Based on a Spectrum-Entropy Beamforming Method
George NOKAS Evangelos DERMATAS

LETTER-Speech and Hearing

Vol:
E86-D No:4
Page(s):
755-758
In this paper, we present a novel beam-former capable of tracking a rapidly moving speaker in a very noisy environment. The localization algorithm extracts a set of candidate direction-of-arrival (DOA) for the signal sources using array signal processing methods in the frequency domain. A minimum variance (MV) beam-former identifies the speech signal DOA in the direction where the signal's spectrum entropy is minimized. A fine tuning process detects the MV direction which is closest to the initial estimation using a smaller analysis window. Extended experiments, carried out in the range of 20-0 dB SNR, show significant improvement in the recognition rate of a moving speaker especially in very low SNRs (from 11.11% to 43.79% at 0 dB SNR in anechoic environment and from 9.9% to 30.51% in reverberant environment).
Automatic LSI Package Lead Inspection System with CCD Camera for Backside Lead Specification
Wataru TAMAMURA Koji NAKAMAE Hiromu FUJIOKA

PAPER-Integrated Electronics

Vol:
E86-C No:4
Page(s):
661-667
An automatic LSI package lead inspection system for backside lead specification is proposed. The proposed system inspects not only lead backside contamination but also the mechanical lead specification such as lead pitch, lead offset and lead overhangs (variations in lead lengths). The total inspection time of a UQFP package with a lead count of 256 is less than the required time of 1 second. Our proposed method is superior to the threshold method used usually, especially for the defect between leads.
Audio-Visual Speech Recognition Based on Optimized Product HMMs and GMM Based-MCE-GPD Stream Weight Estimation
Kenichi KUMATANI Satoshi NAKAMURA

PAPER-Speech and Speaker Recognition

Vol:
E86-D No:3
Page(s):
454-463
In this paper, we describe an adaptive integration method for an audio-visual speech recognition system that uses not only the speaker's audio speech signal but visual speech signals like lip images. Human beings communicate with each other by integrating multiple types of sensory information such as hearing and vision. Such integration can be applied to automatic speech recognition, too. In the integration of audio and visual speech features for speech recognition, there are two important issues, i.e., (1) a model that represents the synchronous and asynchronous characteristics between audio and visual features, and makes the best use of a whole database that includes uni-modal, audio only, or visual only data as well as audio-visual data, and (2) the adaptive estimation of reliability weights for the audio and visual information. This paper mainly investigates two issues and proposes a novel method to effectively integrate audio and visual information in an audio-visual Automatic Speech Recognition (ASR) system. First, as the model that integrates audio-visual speech information, we apply a product of hidden Markov models (product HMM), the product of an audio HMM and a visual HMM. We newly propose a method that re-estimates the product HMM using audio-visual synchronous speech data so as to train the synchronicity of the audio-visual information, while the original product HMM assumes independence from audio-visual features. Second, for the optimal audio-visual information reliability weight estimation, we propose a Gaussian mixture model (GMM) based-MCE-GPD (minimum classification error and generalized probabilistic descent) algorithm, which enables reductions in the amount of adaptation data and amount of computations required for the GMM estimation. Evaluation experiments show that the proposed audio-visual speech recognition system improves the recognition accuracy over conventional ones even if the audio signals are clean.
A Study on Acoustic Modeling of Pauses for Recognizing Noisy Conversational Speech
Jin-Song ZHANG Konstantin MARKOV Tomoko MATSUI Satoshi NAKAMURA

PAPER-Robust Speech Recognition and Enhancement

Vol:
E86-D No:3
Page(s):
489-496
This paper presents a study on modeling inter-word pauses to improve the robustness of acoustic models for recognizing noisy conversational speech. When precise contextual modeling is used for pauses, the frequent appearances and varying acoustics of pauses in noisy conversational speech make it a problem to automatically generate an accurate phonetic transcription of the training data for developing robust acoustic models. This paper presents a proposal to exploit the reliable phonetic heuristics of pauses in speech to aid the detection of varying pauses. Based on it, a stepwise approach to optimize pause HMMs was applied to the data of the DARPA SPINE2 project, and more correct phonetic transcription was achieved. The cross-word triphone HMMs developed using this method got an absolute 9.2% word error reduction when compared to the conventional method with only context free modeling of pauses. For the same pause modeling method, the use of the optimized phonetic segmentation brought about an absolute 5.2% improvements.
A Context Clustering Technique for Average Voice Models
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI

PAPER-Speech Synthesis and Prosody

Vol:
E86-D No:3
Page(s):
534-542
This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
Automatic Estimation of Accentual Attribute Values of Words for Accent Sandhi Rules of Japanese Text-to-Speech Conversion
Nobuaki MINEMATSU Ryuji KITA Keikichi HIROSE

PAPER-Speech Synthesis and Prosody

Vol:
E86-D No:3
Page(s):
550-557
Accurate estimation of accentual attribute values of words, which is required to apply rules of Japanese word accent sandhi to prosody generation, is an important factor to realize high-quality text-to-speech (TTS) conversion. The rules were already formulated by Sagisaka et al. and are widely used in Japanese TTS conversion systems. Application of these rules, however, requires values of a few accentual attributes of each constituent word of input text. The attribute values cannot be found in any public database or any accent dictionaries of Japanese. Further, these values are difficult even for native speakers of Japanese to estimate only with their introspective consideration of properties of their mother tongue. In this paper, an algorithm was proposed, where these values were automatically estimated from a large amount of data of accent types of accentual phrases, which were collected through a long series of listening experiments. In the proposed algorithm, inter-speaker differences of knowledge of accent sandhi were well considered. To improve the coverage of the estimated values over the obtained data, the rules were tentatively modified. Evaluation experiments using two-mora accentual phrases showed the high validity of the estimated values and the modified rules and also some defects caused by varieties of linguistic expressions of Japanese.
Filter Bank Subtraction for Robust Speech Recognition
Kazuo ONOE Hiroyuki SEGI Takeshi KOBAYAKAWA Shoei SATO Shinichi HOMMA Toru IMAI Akio ANDO

PAPER-Robust Speech Recognition and Enhancement

Vol:
E86-D No:3
Page(s):
483-488
In this paper, we propose a new technique of filter bank subtraction for robust speech recognition under various acoustic conditions. Spectral subtraction is a simple and useful technique for reducing the influence of additive noise. Conventional spectral subtraction assumes accurate estimation of the noise spectrum and no correlation between speech and noise. Those assumptions, however, are rarely satisfied in reality, leading to the degradation of speech recognition accuracy. Moreover, the recognition improvement attained by conventional methods is slight when the input SNR changes sharply. We propose a new method in which the output values of filter banks are used for noise estimation and subtraction. By estimating noise at each filter bank, instead of at each frequency point, the method alleviates the necessity for precise estimation of noise. We also take into consideration expected phase differences between the spectra of speech and noise in the subtraction and control a subtraction coefficient theoretically. Recognition experiments on test sets at several SNRs showed that the filter bank subtraction technique improved the word accuracy significantly and got better results than conventional spectral subtraction on all the test sets. In other experiments, on recognizing speech from TV news field reports with environmental noise, the proposed subtraction method yielded better results than the conventional method.
Speaker Recognition Using Adaptively Boosted Classifiers
Say-Wei FOO Eng-Guan LIM

PAPER-Speech and Speaker Recognition

Vol:
E86-D No:3
Page(s):
474-482
In this paper, a novel approach to speaker recognition is proposed. The approach makes use of adaptive boosting (AdaBoost) and classifiers such as Multilayer Perceptrons (MLP) and C4.5 Decision Trees for closed set, text-dependent speaker recognition. The performance of the systems is assessed using a subset of utterances drawn from the YOHO speaker verification corpus. Experiments show that significant improvement in accuracy can be achieved with the application of adaptive boosting techniques. Results also reveal that an accuracy of 98.8% for speaker identification may be achieved using the adaptively boosted C4.5 system.
Face-to-Talk: Audio-Visual Speech Detection for Robust Speech Recognition in Noisy Environment
Kazumasa MURAI Satoshi NAKAMURA

PAPER-Robust Speech Recognition and Enhancement

Vol:
E86-D No:3
Page(s):
505-513
This paper discusses "face-to-talk" audio-visual speech detection for robust speech recognition in noisy environment, which consists of facial orientation based switch and audio-visual speech section detection. Most of today's speech recognition systems must actually turned on and off by a switch e.g. "push-to-talk" to indicate which utterance should be recognized, and a specific speech section must be detected prior to any further analysis. To improve usability and performance, we have researched how to extract the useful information from visual modality. We implemented a facial orientation based switch, which activates the speech recognition during a speaker is facing to the camera. Then, the speech section is detected by analyzing the image of the face. Visual speech detection is robust to audio noise, but because the articulation starts prior to the speech and lasts longer than the speech, the detected section tends to be longer and ends up with insertion errors. Therefore, we have fused the audio-visual modality detected sections. Our experiment confirms that the proposed audio-visual speech detection method improves recognition performance in noisy environment.
A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition
Konstantin MARKOV Satoshi NAKAMURA

PAPER-Speech and Speaker Recognition

Vol:
E86-D No:3
Page(s):
438-445
In current HMM based speech recognition systems, it is difficult to supplement acoustic spectrum features with additional information such as pitch, gender, articulator positions, etc. On the other hand, Bayesian Networks (BN) allow for easy combination of different continuous as well as discrete features by exploring conditional dependencies between them. However, the lack of efficient algorithms has limited their application in continuous speech recognition. In this paper we propose new acoustic model, where HMM are used for modeling of temporal speech characteristics and state probability model is represented by BN. In our experimental system based on HMM/BN model, in addition to speech observation variable, state BN has two more (hidden) variables representing noise type and SNR value. Evaluation results on AURORA2 database showed 36.4% word error rate reduction for closed noise test which is comparable with other much more complex systems utilizing effective adaptation and noise robust methods.
Continuous Speech Recognition Using an On-Line Speaker Adaptation Method Based on Automatic Speaker Clustering
Wei ZHANG Seiichi NAKAGAWA

PAPER-Speech and Speaker Recognition

Vol:
E86-D No:3
Page(s):
464-473
This paper evaluates an on-line incremental speaker adaptation method for co-channel conversation including multiple speakers with the assumption that the speaker is unknown and changes frequently. After performing the speaker clustering treatment based on the Vector Quantization (VQ) distortion for every utterance, acoustic models for each cluster are adapted by Maximum Likelihood Linear Regression (MLLR) or Maximum A Posteriori probability (MAP). The performance of continuous speech recognition could be improved. In this paper, to prove the efficiency of the speaker clustering method for improving the performance of continuous speech recognition, the continuous speech recognition experiments with supervised and unsupervised cluster adaptation were conducted, respectively. Finally, evaluation experiments based on other prepared test data were performed on continuous syllable recognition and large vocabulary continuous speech recognition (LVCSR). The efficiency of the speaker adaptation and clustering methods presented in this paper was supported strongly by the experimental results.
Language Modeling Using Patterns Extracted from Parse Trees for Speech Recognition
Takatoshi JITSUHIRO Hirofumi YAMAMOTO Setsuo YAMADA Genichiro KIKUI Yoshinori SAGISAKA

PAPER-Speech and Speaker Recognition

Vol:
E86-D No:3
Page(s):
446-453
We propose new language models to represent phrasal structures by patterns extracted from parse trees. First, modified word trigram models are proposed. They are extracted from sentences analyzed by the preprocessing of the parser with knowledge. Since sentences are analyzed to create sub-trees of a few words, these trigram models can represent relations among a few neighbor words more strongly than conventional word trigram models. Second, word pattern models are used on these modified word trigram models. The word patterns are extracted from parse trees and can represent phrasal structures and much longer word-dependency than trigram models. Experimental results show that modified trigram models are more effective than traditional trigram models and that pattern models attain slight improvements over modified trigram models. Furthermore, additional experiments show that pattern models are more effective for long sentences.
Modified Restricted Temporal Decomposition and Its Application to Low Rate Speech Coding
Phu Chien NGUYEN Takao OCHI Masato AKAGI

PAPER-Speech and Audio Coding

Vol:
E86-D No:3
Page(s):
397-405
This paper presents a method of temporal decomposition (TD) for line spectral frequency (LSF) parameters, called "Modified Restricted Temporal Decomposition" (MRTD), and its application to low rate speech coding. The LSF parameters have not been used for TD due to the stability problems in the linear predictive coding (LPC) model. To overcome this deficiency, a refinement process is applied to the event vectors in the proposed TD method to preserve their LSF ordering property. Meanwhile, the restricted second order TD model, where only two adjacent event functions can overlap and all event functions at any time sum up to one, is utilized to reduce the computational cost of TD. In addition, based on the geometric interpretation of TD the MRTD method enforces a new property on the event functions, named the "well-shapedness" property, to model the temporal structure of speech more effectively. This paper also proposes a method for speech coding at rates around 1.2 kbps based on STRAIGHT, a high quality speech analysis-synthesis method, using MRTD. In this speech coding method, MRTD based vector quantization is used for encoding spectral information of speech. Subjective test results indicate that the speech quality of the proposed speech coding method is close to that of the 4.8 kbps FS-1016 CELP coder.
A New Multistage Search of Algebraic CELP Codebooks Based on Trellis Coding
Mohammed HALIMI Abdellah KADDAI Messaoud BENGHERABI

PAPER-Speech and Audio Coding

Vol:
E86-D No:3
Page(s):
406-411
This paper proposes a new multistage technique of algebraic codebook in CELP coders called Trellis Search inspired from the Trellis Coded Quantization (TCQ). This search technique is implemented into the fixed codebook of the standard G.729 for objective evaluation on a large corpus of a testing speech database. Simulations results show that in terms of computer execution time the proposed search scheme reduces the codebook search by approximately 23% compared to the time of focused search used in the standard G.729. This yields to a reduction of about 8% in the computer execution time of the coder at the cost of a slight degradation of speech quality but perceptually not noticeable. Moreover, this new technique shows better speech quality than the G.729A at the expense of a higher complexity.
Grey Filtering and Its Application to Speech Enhancement
Cheng-Hsiung HSIEH

PAPER-Robust Speech Recognition and Enhancement

Vol:
E86-D No:3
Page(s):
522-533
In this paper, a grey filtering approach based on GM(1,1) model is proposed. Then the grey filtering is applied to speech enhancement. The fundamental idea in the proposed grey filtering is to relate estimation error of GM(1,1) model to additive noise. The simulation results indicate that the additive noise can be estimated accurately by the proposed grey filtering approach with an appropriate scaling factor. Note that the spectral subtraction approach to speech enhancement is heavily dependent on the accuracy of statistics of additive noise and that the grey filtering is able to estimate additive noise appropriately. A magnitude spectral subtraction (MSS) approach for speech enhancement is proposed where the mechanism to determine the non-speech and speech portions is not required. Two examples are provided to justify the proposed MSS approach based on grey filtering. The simulation results show that the objective of speech enhancement has been achieved by the proposed MSS approach. Besides, the proposed MSS approach is compared with HFR-based approach in [4] and ZP approach in [5]. Simulation results indicate that in most of cases HFR-based and ZP approaches outperform the proposed MSS approach in SNRimp. However, the proposed MSS approach has better subjective listening quality than HFR-based and ZP approaches.
Crosstalk Equalization for High-Speed Digital Transmission Systems
Hui-Chul WON Gi-Hong IM

PAPER-Wireless Communication Technology

Vol:
E86-B No:3
Page(s):
1063-1072
In this paper, we discuss crosstalk equalization technique for high-speed digital transmission systems. This equalization technique makes use of the cyclostationarity of the crosstalk interferer. We first analyze the eigenstructure of the equalizer in the presence of cyclostationary crosstalk interference. It is shown that the eigenvalues of the equalizer depend upon the folded signal and interferer power spectra, and the cross power spectrum between the signal and the interferer. The expressions of the minimum mean square error (MMSE) and the excess MSE are then obtained by using the equalizer's eigenstructure. Analysis and simulation results indicate that such peculiar equalizer's eigenstructure in the presence of cyclostationary interference results in significantly different initial convergence and steady-state behaviors as compared with the stationary noise case. We also show that the performance of the equalizer varies depending on the relative clock phase of the symbol clocks used by the signal and the crosstalk interferer.

1681-1700hit(2504hit)

Keyword Search Result

[Keyword] SPE(2504hit)

Intra-Channel Nonlinearities and Dispersion-Management in Highly Dispersed Transmission

Robust Model for Speaker Verification against Session-Dependent Utterance Variation

Cancellation of Narrowband Interference in GPS Receivers Using NDEKF-Based Recurrent Neural Network Predictors

Signal Integrity Design and Analysis for a 400 MHz RISC Microcontroller

Speaker Tracking for Hands-Free Continuous Speech Recognition in Noise Based on a Spectrum-Entropy Beamforming Method

Automatic LSI Package Lead Inspection System with CCD Camera for Backside Lead Specification

Audio-Visual Speech Recognition Based on Optimized Product HMMs and GMM Based-MCE-GPD Stream Weight Estimation

A Study on Acoustic Modeling of Pauses for Recognizing Noisy Conversational Speech

A Context Clustering Technique for Average Voice Models

Automatic Estimation of Accentual Attribute Values of Words for Accent Sandhi Rules of Japanese Text-to-Speech Conversion

Filter Bank Subtraction for Robust Speech Recognition

Speaker Recognition Using Adaptively Boosted Classifiers

Face-to-Talk: Audio-Visual Speech Detection for Robust Speech Recognition in Noisy Environment

A Hybrid HMM/BN Acoustic Model for Automatic Speech Recognition

Continuous Speech Recognition Using an On-Line Speaker Adaptation Method Based on Automatic Speaker Clustering

Language Modeling Using Patterns Extracted from Parse Trees for Speech Recognition

Modified Restricted Temporal Decomposition and Its Application to Low Rate Speech Coding

A New Multistage Search of Algebraic CELP Codebooks Based on Trellis Coding

Grey Filtering and Its Application to Speech Enhancement

Crosstalk Equalization for High-Speed Digital Transmission Systems

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles