Conventional confidence measures for assessing the reliability of ASR (automatic speech recognition) output are typically derived from "low-level" information which is obtained during speech recognition decoding. In contrast to these approaches, we propose a novel utterance verification framework which incorporates "high-level" knowledge sources. Specifically, we investigate two application-independent measures: in-domain confidence, the degree of match between the input utterance and the application domain of the back-end system, and discourse coherence, the consistency between consecutive utterances in a dialogue session. A joint confidence score is generated by combining these two measures with an orthodox measure based on GPP (generalized posterior probability). The proposed framework was evaluated on an utterance verification task for spontaneous dialogue performed via a (English/Japanese) speech-to-speech translation system. Incorporating the two proposed measures significantly improved utterance verification accuracy compared to using GPP alone, realizing reductions in CER (confidence error-rate) of 11.4% and 8.1% for the English and Japanese sides, respectively. When negligible ASR errors (that do not affect translation) were ignored, further improvement was achieved for the English side, realizing a reduction in CER of up to 14.6% compared to the GPP case.
Weifeng LI Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA
We address issues for improving hands-free speech enhancement and speech recognition performance in different car environments using a single distant microphone. This paper describes a new single-channel in-car speech enhancement method that estimates the log spectra of speech at a close-talking microphone based on the nonlinear regression of the log spectra of noisy signal captured by a distant microphone and the estimated noise. The proposed method provides significant overall quality improvements in our subjective evaluation on the regression-enhanced speech, and performed best in most objective measures. Based on our isolated word recognition experiments conducted under 15 real car environments, the proposed adaptive nonlinear regression approach shows an advantage in average relative word error rate (WER) reductions of 50.8% and 13.1%, respectively, compared to original noisy speech and ETSI advanced front-end (ETSI ES 202 050).
Scaling of CMOS Integrated Circuit is becoming difficult, due mainly to rapid increase in power dissipation. How will the semiconductor technology and industry develop? This paper discusses challenges and opportunities in system LSI from three levels of perspectives: transistor level (physics), IC level (electronics), and business level (economics).
In this letter, a hybrid selection combining (SC) and maximal ratio receive combining (MRRC) technique is proposed for orthogonal frequency-division multiplexing (OFDM) systems with multiple receive antennas. The proposed technique still uses multiple receive antennas, but it has just a single RF front-end and a single baseband demodulator. In comparison with the OFDM system with no diversity, we can achieve superior gain irrespective of bandwidth efficiency, and also in comparison with the MRRC OFDM, we can achieve better gain under the bandwidth efficiency of 3 bps/Hz at the bit error rate of 10-6.
Minimum Bayes risk estimation and decoding strategies based on lattice segmentation techniques can be used to refine large vocabulary continuous speech recognition systems through the estimation of the parameters of the underlying hidden Markov models and through the identification of smaller recognition tasks which provides the opportunity to incorporate novel modeling and decoding procedures in LVCSR. These techniques are discussed in the context of going 'beyond HMMs', showing in particular that this process of subproblem identification makes it possible to train and apply small-domain binary pattern classifiers, such as Support Vector Machines, to large vocabulary continuous speech recognition.
Junni ZOU Hongkai XIONG Rujian LIN
To simultaneously support guaranteed real-time services and best-effort service, a Priority-based Scheduling Architecture (PSA) designed for high-speed switches is proposed. PSA divides packet scheduling into high-priority phase and low-priority phase. In the high-priority phase, an improved sorted-priority algorithm is presented. It introduces a new constraint into the scheduling discipline to overcome bandwidth preemption. Meanwhile, the virtual time function with a control factor α is employed. Both computer simulation results and theoretic analysis show that the PSA mechanism has excellent performance in terms of the implementation complexity, fairness and delay properties.
A novel approach for segmentation of grayscale images, which are color scene originally, is proposed. Many algorithms have been elaborated for a grayscale image segmentation. All those approaches have been discussed in a luminance space, because it has been considered that grayscale images do not have any color information. However, a luminance value has color information as a set of corresponding colors. In this paper, an inverse mapping of luminance values is carried out to CIELAB color space, and the image segmentation for grayscale images is performed based on a distance in the color space. The proposed scheme is applied to a region growing segmentation and the performance is verified.
Sakriani SAKTI Satoshi NAKAMURA Konstantin MARKOV
Over the last decade, the Bayesian approach has increased in popularity in many application areas. It uses a probabilistic framework which encodes our beliefs or actions in situations of uncertainty. Information from several models can also be combined based on the Bayesian framework to achieve better inference and to better account for modeling uncertainty. The approach we adopted here is to utilize the benefits of the Bayesian framework to improve acoustic model precision in speech recognition systems, which modeling a wider-than-triphone context by approximating it using several less context-dependent models. Such a composition was developed in order to avoid the crucial problem of limited training data and to reduce the model complexity. To enhance the model reliability due to unseen contexts and limited training data, flooring and smoothing techniques are applied. Experimental results show that the proposed Bayesian pentaphone model improves word accuracy in comparison with the standard triphone model.
Jehyuk RYU Sungho YUN Kyungjin SONG Jundong CHO Jongmoo CHOI Sukhan LEE
This paper introduces the hardware platform of the structured light processing based on depth imaging to perform a 3D modeling of cluttered workspace for home service robots. We have discovered that the degradation of precision and robustness comes mainly from the overlapping of multiple codes in the signal received at a camera pixel. Considering the criticality of separating the overlapped codes to precision and robustness, we proposed a novel signal separation code, referred to here as "Hierarchically Orthogonal Code (HOC)," for depth imaging. The proposed HOC algorithm was implemented by using hardware platform which applies the Xilinx XC2V6000 FPGA to perform a real time 3D modeling and the invisible IR (Infrared) pattern lights to eliminate any inconveniences for the home environment. The experimental results have shown that the proposed HOC algorithm significantly enhances the robustness and precision in depth imaging, compared to the best known conventional approaches. Furthermore, after we processed the HOC algorithm implemented on our hardware platform, the results showed that it required 34 ms of time to generate one 3D image. This processing time is about 24 times faster than the same implementation of HOC algorithm using software, and the real-time processing is realized.
Yukihito OOWAKI Shinichiro SHIRATAKE Toshihide FUJIYOSHI Mototsugu HAMADA Fumitoshi HATORI Masami MURAKATA Masafumi TAKAHASHI
The module-wise dynamic voltage and frequency scaling (MDVFS) scheme is applied to a single-chip H.264/MPEG-4 audio/visual codec LSI. The power consumption of the target module with controlled supply voltage and frequency is reduced by 40% in comparison with the operation without voltage or frequency scaling. The consumed power of the chip is 63 mW in decoding QVGA H.264 video at 15 fps and MPEG-4 AAC LC audio simultaneously. This LSI keep operating continuously even during the voltage transition of the target module by introducing the newly developed dynamic de-skewing system (DDS) which watches and control the clock edge of the target module.
The objective of this paper is to present a decision support system which uses a computer-based procedure to detect tumor blocks or lesions in digitized medical images. The authors developed a simple method with a low computation effort to detect tumors on T2-weighted Magnetic Resonance Imaging (MRI) brain images, focusing on the connection between the spatial pixel value and tumor properties from four different perspectives: 1) cases having minuscule differences between two images using a fixed block-based method, 2) tumor shape and size using the edge and binary images, 3) tumor properties based on texture values using spatial pixel intensity distribution controlled by a global discriminate value, and 4) the occurrence of content-specific tumor pixel for threshold images. Measurements of the following medical datasets were performed: 1) different time interval images, and 2) different brain disease images on single and multiple slice images. Experimental results have revealed that our proposed technique incurred an overall error smaller than those in other proposed methods. In particular, the proposed method allowed decrements of false alarm and missed alarm errors, which demonstrate the effectiveness of our proposed technique. In this paper, we also present a prototype system, known as PCB, to evaluate the performance of the proposed methods by actual experiments, comparing the detection accuracy and system performance.
Huiqing ZHAI Qiang CHEN Qiaowei YUAN Kunio SAWAYA Changhong LIANG
This paper presents method that offers the fast and accurate analysis of large-scale periodic array antennas by conjugate-gradient fast Fourier transform (CG-FFT) combined with an equivalent sub-array preconditioner. Method of moments (MoM) is used to discretize the electric field integral equation (EFIE) and form the impedance matrix equation. By properly dividing a large array into equivalent sub-blocks level by level, the impedance matrix becomes a structure of Three-level Block Toeplitz Matrices. The Three-level Block Toeplitz Matrices are further transformed to Circulant Matrix, whose multiplication with a vector can be rapidly implemented by one-dimension (1-D) fast Fourier transform (FFT). Thus, the conjugate-gradient fast Fourier transform (CG-FFT) is successfully applied to the analysis of a large-scale periodic dipole array by speeding up the matrix-vector multiplication in the iterative solver. Furthermore, an equivalent sub-array preconditioner is proposed to combine with the CG-FFT analysis to reduce iterative steps and the whole CPU-time of the iteration. Some numerical results are given to illustrate the high efficiency and accuracy of the present method.
In this paper, we introduce a new method for depth perception from a 2D natural scene using scale variation of patterns. As the surface from a 2D scene gets farther away from us, the texture appears finer and smoother. Texture gradient is one of the monocular depth cues which can be represented by gradual scale variations of textured patterns. To extract feature vectors from textured patterns, higher order local autocorrelation functions are utilized at each scale step. The hierarchical linear discriminant analysis is employed to classify the scale rate of the feature vector which can be divided into subspaces by recursively grouping the overlapped classes. In the experiment, relative depth perception of 2D natural scenes is performed on the proposed method and it is expected to play an important role in natural scene analysis.
Shoei SATO Kazuo ONOE Akio KOBAYASHI Toru IMAI
This paper proposes a new compensation method of acoustic scores in the Viterbi search for robust speech recognition. This method introduces noise models to represent a wide variety of noises and realizes robust decoding together with conventional techniques of subtraction and adaptation. This method uses likelihoods of noise models in two ways. One is to calculate a confidence factor for each input frame by comparing likelihoods of speech models and noise models. Then the weight of the acoustic score for a noisy frame is reduced according to the value of the confidence factor for compensation. The other is to use the likelihood of noise model as an alternative that of a silence model when given noisy input. Since a lower confidence factor compresses acoustic scores, the decoder rather relies on language scores and keeps more hypotheses within a fixed search depth for a noisy frame. An experiment using commentary transcriptions of a broadcast sports program (MLB: Major League Baseball) showed that the proposed method obtained a 6.7% relative word error reduction. The method also reduced the relative error rate of key words by 17.9%, and this is expected lead to an improvement metadata extraction accuracy.
Yoshihiro YAMAGAMI Yoshifumi NISHIO Akio USHIDA
We consider oscillators consisting of a reactance circuit and a negative resistor. They may happen to have multi-mode oscillations around the anti-resonant frequencies of the reactance circuit. This kind of oscillators can be easily synthesized by setting the resonant and anti-resonant frequencies of the reactance circuits. However, it is not easy to analyze the oscillation phenomena, because they have multiple oscillations whose oscillations depend on the initial guesses. In this paper, we propose a Spice-oriented solution algorithm combining the harmonic balance method with Newton homotopy method that can find out the multiple solutions on the homotopy paths. In our analysis, the determining equations from the harmonic balance method are given by modified equivalent circuit models of "DC," "Cosine" and "Sine" circuits. The modified circuits can be solved by a simulator STC (solution curve tracing circuit), where the multiple oscillations are found by the transient analysis of Spice. Thus, we need not to derive the troublesome circuit equations, nor the mathematical transformations to get the determining equations. It makes the solution algorithms much simpler.
Narumi UMEDA Lan CHEN Hidetoshi KAYAMA
Supporting diversified rates for real-time communications will become possible and essential with the rapidly increasing transmission rates provided by the 4th generation (4G) mobile communication systems. In this paper, a novel wireless Quality of Service (QoS) scheme suitable for broadband CDMA packet cellular systems with adaptive modulation coding is proposed and its characteristics are described. The proposed QoS scheme comprises several control factors laid on the MAC and RRC layers, and can be harmonized with IP-QoS. Two important control factors are proposed: radio-condition-aware admission control and resource allocation reflected multistage scheduling. Computer simulations and testbed experiments indicate that by using the radio-condition-aware admission control, stable and guaranteed service can be provided to real-time users regardless of the interference and the variation in the location of the mobile station. Moreover, resource allocation reflected multistage scheduling maintains guaranteed rates for real-time users and provides high resource utilization efficiency for best-effort users. Consequently, by using the proposed wireless QoS scheme, it is possible to provide users with high quality and diversified real-time services, on a packet based radio network for enhanced 3G and beyond.
Landscapes have been the main theme in Chinese painting for over one thousand years. Chinese ink painting is a form of non-photorealistic rendering. Terrain is the major subject in Chinese landscape painting, and surface wrinkles are important in conveying the orientation of mountains and contributing to the atmosphere. Over the centuries, masters of Chinese landscape painting have developed various kinds of wrinkles. This work develops a set of novel methods for rendering wrinkles in Chinese landscape painting. A three-dimensional terrain is drawn as an outline and wrinkles, using information on the shape, shade and orientation of the terrain's polygonal surface. The major contribution of this work lies in the modeling and implementation of six major types of wrinkles on the surface of terrain, using traditional Chinese brush techniques. Users can select a style of wrinkle and input parameters to control the desired effect. The proposed method then completes the painting process automatically.
This paper presents a personal identification method based on BMME and LDA for images acquired at anterior and posterior occlusion expression of teeth. The method consists of teeth region extraction, BMME, and pattern recognition for the images acquired at the anterior and posterior occlusion state of teeth. Two occlusions can provide consistent teeth appearance in images and BMME can reduce matching error in pattern recognition. Using teeth images can be beneficial in recognition because teeth, rigid objects, cannot be deformed at the moment of image acquisition. In the experiments, the algorithm was successful in teeth recognition for personal identification for 20 people, which encouraged our method to be able to contribute to multi-modal authentication systems.
When the frame size is downscaled for video transcoding, the new motion vector (MV) must be computed. This paper presents an algorithm to utilize the activity measurement by DC value and the number of non-zero quantized DCT coefficients in the residual macroblock to compose the motion vector. It can reduce the complexity for motion estimation and improve the performance of the spatial domain video transcoder.
Noriko Y. YAMASAKI Yoh TAKEI Kensuke MASUI Kazuhisa MITSUDA Toshimitsu MOROOKA Satoshi NAKAYAMA
In frequency-domain multiplexing (FDM) for TES signals, a magnetic field summation method utilizing a multi-input SQUID has the fundamental merit of small degradation of the signal-to-noise ratio. We formulated shifts of the operation point due to a common impedance and cross talk currents. These effects are evaluated for several FDM methods, and the requirements for the bandwidth and filters are summarized. The design parameters of multi-input SQUIDs and a flux locked loop driving circuits are also presented.