Osamu ICHIKAWA Tetsuya TAKIGUCHI Masafumi NISHIMURA
It is believed that distant-talking speech recognition in a noisy environment requires a large-scale microphone array. However, this cannot fit into small consumer devices. Our objective is to improve the performance with a limited number of microphones (preferably only left and right). In this paper, we focused on a profile that is the shape of the power distribution according to the beamforming direction. An observed profile can be decomposed into known profiles for directional sound sources and a non-directional background sound source. Evaluations confirmed this method reduced the CER (Character Error Ratio) for the dictation task by more than 20% compared to a conventional 2-channel Adaptive Spectral Subtraction beamformer in a non-reverberant environment.
In this paper, a grey filtering approach based on GM(1,1) model is proposed. Then the grey filtering is applied to speech enhancement. The fundamental idea in the proposed grey filtering is to relate estimation error of GM(1,1) model to additive noise. The simulation results indicate that the additive noise can be estimated accurately by the proposed grey filtering approach with an appropriate scaling factor. Note that the spectral subtraction approach to speech enhancement is heavily dependent on the accuracy of statistics of additive noise and that the grey filtering is able to estimate additive noise appropriately. A magnitude spectral subtraction (MSS) approach for speech enhancement is proposed where the mechanism to determine the non-speech and speech portions is not required. Two examples are provided to justify the proposed MSS approach based on grey filtering. The simulation results show that the objective of speech enhancement has been achieved by the proposed MSS approach. Besides, the proposed MSS approach is compared with HFR-based approach in [4] and ZP approach in [5]. Simulation results indicate that in most of cases HFR-based and ZP approaches outperform the proposed MSS approach in SNRimp. However, the proposed MSS approach has better subjective listening quality than HFR-based and ZP approaches.
In this paper, we discuss crosstalk equalization technique for high-speed digital transmission systems. This equalization technique makes use of the cyclostationarity of the crosstalk interferer. We first analyze the eigenstructure of the equalizer in the presence of cyclostationary crosstalk interference. It is shown that the eigenvalues of the equalizer depend upon the folded signal and interferer power spectra, and the cross power spectrum between the signal and the interferer. The expressions of the minimum mean square error (MMSE) and the excess MSE are then obtained by using the equalizer's eigenstructure. Analysis and simulation results indicate that such peculiar equalizer's eigenstructure in the presence of cyclostationary interference results in significantly different initial convergence and steady-state behaviors as compared with the stationary noise case. We also show that the performance of the equalizer varies depending on the relative clock phase of the symbol clocks used by the signal and the crosstalk interferer.
Transmit adaptive array requires the forward link channel state for evaluating the optimum transmit weight in which a feedback channel transports the forward link channel state to the base station. Since the feedback information limits the transmission rate of the reverse link traffic, it is necessary to keep the number of feedback bits to a minimum. This paper presents a system in which the N transmit antennas are extended to the 2N transmit antennas while the feedback channel is limited as that of N-transmit antenna system. The increased antennas can give additional diversity gain but requires higher rate of feedback bits. The limited feedback channel increases the quantization error of feedback information since the number of feedback bits assigned to each antenna is reduced. In order to overcome the limited rate of feedback channel problem, this paper proposes the transmit antenna selection schemes which can effectively use the limited feedback bits, reduce the computational complexity at the mobile station, and eventually achieve diversity gain. System performances are investigated for the case of N=4 for the various antenna selection schemes on both flat fading and multi-path fading channels.
Sung-Hyun YANG Younggap YOU Kyoung-Rok CHO
A dual-modulus (divide-by-128/129) prescaler has been designed based on 0.25-µm CMOS technology employing new D-flip-flops. The new D-flip-flops are free from glitch problems due to internal charge sharing. Transistor merging technique has been employed to reduce the number of transistors and to secure reliable high-speed operation. At the 2.5-V supply voltage, the prescaler using the proposed dynamic D-flip-flops can operate up to the frequency of 2.95-GHz, and consumes about 10% and about 27% less power than Yuan/Svensson's and Huang's circuits, respectively.
Tomoko OHSUGA Yasuo HORIUCHI Akira ICHIKAWA
In this study, we introduce a method for estimating the syntactic structure of Japanese speech from F0 contour and pause duration. We defined a prosodic unit (PU) which is divided by the local minimal point of an F0 contour or pause. Combining PUs repeatedly (a pair of PUs is combined into one PU), a tree structure is gradually generated. Which pair of PUs in a sequence of three PUs should be combined is decided by a discriminant function based on the discriminant analysis of a corpus of speech data. We applied the method to the ATR Phonetically Balanced Sentences read by four Japanese speakers. We found that with this method, the correct rate of judgement for each sequence of three PUs is 79% and the estimation accuracy of the entire syntactic structure for each sentence is 26%. We consider this result to demonstrate a good degree of accuracy for the difficult task of estimating syntactic structure only from prosody.
In this paper, a novel approach to speaker recognition is proposed. The approach makes use of adaptive boosting (AdaBoost) and classifiers such as Multilayer Perceptrons (MLP) and C4.5 Decision Trees for closed set, text-dependent speaker recognition. The performance of the systems is assessed using a subset of utterances drawn from the YOHO speaker verification corpus. Experiments show that significant improvement in accuracy can be achieved with the application of adaptive boosting techniques. Results also reveal that an accuracy of 98.8% for speaker identification may be achieved using the adaptively boosted C4.5 system.
Jian-Qing LI Hong-Shik PARK Hyeong-Ho LEE
In wavelength division multiplexed networks, shared path protection provides the same level of protection against a single fiber-link failure as dedicated path protection with potentially higher network utilization. The shared path protection is more complex to provision and maintain. In this paper, we introduce a parameter, the degree of sharing, which refers to the number of protection paths that a wavelength can be assigned to on a link. We propose methods for calculating the maximum degree of sharing. We consider on-line routing and wavelength assignment (RWA) of protection paths that are established for incremental traffic using the maximum degree of sharing. Establishment of protection paths using the maximum degree of sharing can simplify the algorithm. We compare the results on the decreased calculation time with accepted connection requests for a given number of wavelengths, assuming that wavelengths are assigned according to the First-Fit policy for working paths and Last-Fit policy for protection paths. The more wavelengths are used, the more calculation time can be reduced. When the load increases, the decreasing rate of calculation time also increases.
Mohammed HALIMI Abdellah KADDAI Messaoud BENGHERABI
This paper proposes a new multistage technique of algebraic codebook in CELP coders called Trellis Search inspired from the Trellis Coded Quantization (TCQ). This search technique is implemented into the fixed codebook of the standard G.729 for objective evaluation on a large corpus of a testing speech database. Simulations results show that in terms of computer execution time the proposed search scheme reduces the codebook search by approximately 23% compared to the time of focused search used in the standard G.729. This yields to a reduction of about 8% in the computer execution time of the coder at the cost of a slight degradation of speech quality but perceptually not noticeable. Moreover, this new technique shows better speech quality than the G.729A at the expense of a higher complexity.
Kuniharu KISHIDA Hidekazu FUKAI Takashi HARA Kazuhiro SHINOSAKI
A new blind identification method of transfer functions between variables in feedback systems is introduced for single sweep type of MEG data. The method is based on the viewpoint of stochastic/statistical inverse problems. The required conditions of the model are stationary and linear Gaussian processes. Raw MEG data of the brain activities are heavily contaminated with several noises and artifacts. The elimination of them is a crucial problem especially for the method. Usually, these noises and artifacts are removed by notch and high-pass filters which are preset automatically. In the present paper, we will try two types of more careful preprocessing procedures for the identification method to obtain impulse functions. One is a careful notch filtering and the other is a blind source separation method based on temporal structure. As results, identifiably of transfer functions and their impulse responses are improved in both cases. Transfer functions and impulse responses identified between MEG sensors are obtained by using the method in Appendix A, when eyes are closed with rest state. Some advantages of the blind source separation method are discussed.
Yoshinori ODA Yasuyuki OHKURA Kaina SUZUKI Sanae ITO Hirotaka AMAKAWA Kenji NISHI
A new analysis method for random dopant induced threshold voltage fluctuations by using Monte Carlo ion implantation were presented. The method was applied to investigate Vt fluctuations due to statistical variation of pocket dopant profile in 0.1µm MOSFET's by 3D process-device simulation system. This method is very useful to analyze a statistical fluctuation in sub-100 nm MOSFET's efficiently.
Kenichi KUMATANI Satoshi NAKAMURA
In this paper, we describe an adaptive integration method for an audio-visual speech recognition system that uses not only the speaker's audio speech signal but visual speech signals like lip images. Human beings communicate with each other by integrating multiple types of sensory information such as hearing and vision. Such integration can be applied to automatic speech recognition, too. In the integration of audio and visual speech features for speech recognition, there are two important issues, i.e., (1) a model that represents the synchronous and asynchronous characteristics between audio and visual features, and makes the best use of a whole database that includes uni-modal, audio only, or visual only data as well as audio-visual data, and (2) the adaptive estimation of reliability weights for the audio and visual information. This paper mainly investigates two issues and proposes a novel method to effectively integrate audio and visual information in an audio-visual Automatic Speech Recognition (ASR) system. First, as the model that integrates audio-visual speech information, we apply a product of hidden Markov models (product HMM), the product of an audio HMM and a visual HMM. We newly propose a method that re-estimates the product HMM using audio-visual synchronous speech data so as to train the synchronicity of the audio-visual information, while the original product HMM assumes independence from audio-visual features. Second, for the optimal audio-visual information reliability weight estimation, we propose a Gaussian mixture model (GMM) based-MCE-GPD (minimum classification error and generalized probabilistic descent) algorithm, which enables reductions in the amount of adaptation data and amount of computations required for the GMM estimation. Evaluation experiments show that the proposed audio-visual speech recognition system improves the recognition accuracy over conventional ones even if the audio signals are clean.
Jin-Song ZHANG Konstantin MARKOV Tomoko MATSUI Satoshi NAKAMURA
This paper presents a study on modeling inter-word pauses to improve the robustness of acoustic models for recognizing noisy conversational speech. When precise contextual modeling is used for pauses, the frequent appearances and varying acoustics of pauses in noisy conversational speech make it a problem to automatically generate an accurate phonetic transcription of the training data for developing robust acoustic models. This paper presents a proposal to exploit the reliable phonetic heuristics of pauses in speech to aid the detection of varying pauses. Based on it, a stepwise approach to optimize pause HMMs was applied to the data of the DARPA SPINE2 project, and more correct phonetic transcription was achieved. The cross-word triphone HMMs developed using this method got an absolute 9.2% word error reduction when compared to the conventional method with only context free modeling of pauses. For the same pause modeling method, the use of the optimized phonetic segmentation brought about an absolute 5.2% improvements.
Junichi YAMAGISHI Masatsune TAMURA Takashi MASUKO Keiichi TOKUDA Takao KOBAYASHI
This paper describes a new context clustering technique for average voice model, which is a set of speaker independent speech synthesis units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a decision tree common to these speaker dependent models for context clustering. When a node of the decision tree is split, only the context related questions which are applicable to all speaker dependent models are adopted. As a result, every node of the decision tree always has training data of all speakers. After construction of the decision tree, all speaker dependent models are clustered using the common decision tree and a speaker independent model, i.e., an average voice model is obtained by combining speaker dependent models. From the results of subjective tests, we show that the average voice models trained using the proposed technique can generate more natural sounding speech than the conventional average voice models.
Nobuaki MINEMATSU Ryuji KITA Keikichi HIROSE
Accurate estimation of accentual attribute values of words, which is required to apply rules of Japanese word accent sandhi to prosody generation, is an important factor to realize high-quality text-to-speech (TTS) conversion. The rules were already formulated by Sagisaka et al. and are widely used in Japanese TTS conversion systems. Application of these rules, however, requires values of a few accentual attributes of each constituent word of input text. The attribute values cannot be found in any public database or any accent dictionaries of Japanese. Further, these values are difficult even for native speakers of Japanese to estimate only with their introspective consideration of properties of their mother tongue. In this paper, an algorithm was proposed, where these values were automatically estimated from a large amount of data of accent types of accentual phrases, which were collected through a long series of listening experiments. In the proposed algorithm, inter-speaker differences of knowledge of accent sandhi were well considered. To improve the coverage of the estimated values over the obtained data, the rules were tentatively modified. Evaluation experiments using two-mora accentual phrases showed the high validity of the estimated values and the modified rules and also some defects caused by varieties of linguistic expressions of Japanese.
Matsuto OGAWA Hideaki TSUCHIYA Tanroku MIYOSHI
We describe progress we have achieved in the development of our quantum transport modeling for nano-scale devices. Our simulation is based upon either the non-equilibrium Green's function method (NEGF) or the quantum correction (QC) associated with density gradient method (DG) and/or effective potential method (EP). We show the results of our modeling methods applied to several devices and discuss issues faced with regards to computational time, open boundary conditions, and their relationship to self-consistent solution of the Poisson-NEGF equations. We also discuss those for efficiently tailored QC Monte Carlo techniques.
Hongwei KONG Ning GE Fang RUAN Chongxi FENG Pingyi FAN
In this paper, we propose a nonlinear control model to characterize the AQM algorithm-GREEN. Based on this model, we analyze its performance and prove that there exists a stable oscillation when in equilibrium. Furthermore, we also investigate the effects of the factors such as bandwidth, round trip time, and load level on the amplitude and frequency of the oscillation. Theoretical analysis and simulation results indicate that GREEN algorithm is insensitive to the network conditions when the link rate and the round trip time are relatively small and becomes more sensitive to the change of network conditions when the bandwidth delay product is relatively high. For GREEN the adaptability to a wide range of network conditions is based on the compromising of the efficiency.
Takao OGURA Junji SUZUKI Akira CHUGO Masafumi KATOH Tomonori AOYAMA
As use of the Internet continues to spread rapidly, Traffic Engineering (TE) is needed to optimize IP network resource utilization. In particular, load balancing with TE can prevent traffic concentration on a single path between ingress and egress routers. To apply TE, we have constructed an MPLS (Multi-Protocol Label Switching) network with TE capability in the JGN (Japan Gigabit Network), and evaluated dynamic load balancing behavior in it from the viewpoint of control stability. We confirmed that with this method, setting appropriate control parameter values enables traffic to be equally distributed over two or more routes in an actual large-scale network. In addition, we verified the method's effectiveness by using a digital cinema application as input traffic.
Chung-Jr LIAN Zhong-Lan YANG Hao-Chieh CHANG Liang-Gee CHEN
This paper presents a hardware-efficient architecture of tree-depth scan (TDS) and multiple quantization (MQ) scheme for zerotree coding in MPEG-4 still texture coder. The proposed TDS architecture can achieve its maximal throughput to area ratio and minimize the external memory access with only one wavelet-tree size on-chip buffer. The MQ scheme adopts the power-of-two (POT) quantization to realize a cost-effective hardware implementation. The prototyping chip has been implemented in TSMC 0.35 µm CMOS 1P4M technology. This architecture can handle 30 4-CIF (704576) frames per second with five spatial scalability and five SNR scalability layers at 100 MHz working frequency.
Tung-Chou CHEN Che-Ho WEI Shyue-Win WEI
Based on a modified step-by-step decoding procedure, a high-speed pipelined Reed-Solomon decoder is presented. The decoder requires only the delay time of three 2-input XOR gates for decoding each coded symbol. The decoder can be operated in a bit rate of Gbits/sec order and thus suitable for the very high speed data transmission systems.