Appropriate language modeling is one of the major issues for automatic transcription of spontaneous speech. We propose an adaptation method for statistical language models based on both topic and speaker characteristics. This approach is applied for automatic transcription of meetings and panel discussions, in which multiple participants speak on a given topic in their own speaking style. A baseline language model is a mixture of two models, which are trained with different corpora covering various topics and speakers, respectively. Then, probabilistic latent semantic analysis (PLSA) is performed on the same respective corpora and the initial ASR result to provide two sets of unigram probabilities conditioned on input speech, with regard to topics and speaker characteristics, respectively. Finally, the baseline model is adapted by scaling N-gram probabilities with these unigram probabilities. For speaker adaptation purpose, we make use of a portion of the Corpus of Spontaneous Japanese (CSJ) in which a large number of speakers gave talks for given topics. Experimental evaluation with real discussions showed that both topic and speaker adaptation reduced test-set perplexity, and in total, an average reduction rate of 8.5% was obtained. Furthermore, improvement on word accuracy was also achieved by the proposed adaptation method.
This paper proposes a novel multi-layer approach to fundamental frequency modeling for concatenative speech synthesis based on a statistical learning technique called additive models. We define an additive F0 contour model consisting of long-term, intonational phrase-level, component and short-term, accentual phrase-level, component, along with a least-squares error criterion that includes a regularization term. A backfitting algorithm, that is derived from this error criterion, estimates both components simultaneously by iteratively applying cubic spline smoothers. When this method is applied to a 7,000 utterance Japanese speech corpus, it achieves F0 RMS errors of 28.9 and 29.8 Hz on the training and test data, respectively, with corresponding correlation coefficients of 0.806 and 0.777. The automatically determined intonational and accentual phrase components turn out to behave smoothly, systematically, and intuitively under a variety of prosodic conditions.
Masahiko MATSUSHITA Hiromitsu NISHIZAKI Takehito UTSURO Seiichi NAKAGAWA
This paper presents speech-driven Web retrieval models which accept spoken search topics (queries) in the NTCIR-3 Web retrieval task. The major focus of this paper is on improving speech recognition accuracy of spoken queries and then improving retrieval accuracy in speech-driven Web retrieval. We experimentally evaluated the techniques of combining outputs of multiple LVCSR models in recognition of spoken queries. As model combination techniques, we compared the SVM learning technique with conventional voting schemes such as ROVER. In addition, for investigating the effects on the retrieval performance in vocabulary size of the language model, we prepared two kinds of language models: the one's vocabulary size was 20,000, the other's one was 60,000. Then, we evaluated the differences in the recognition rates of the spoken queries and the retrieval performance. We showed that the techniques of multiple LVCSR model combination could achieve improvement both in speech recognition and retrieval accuracies in speech-driven text retrieval. Comparing with the retrieval accuracies when an LM with a 20,000/60,000 vocabulary size is used in an LVCSR system, we found that the larger the vocabulary size is, the better the retrieval accuracy is.
Kazuya TAKEDA Hiroshi FUJIMURA Katsunobu ITOU Nobuo KAWAGUCHI Shigeki MATSUBARA Fumitada ITAKURA
In this paper, we discuss the construction of a large in-car spoken dialogue corpus and the result of its analysis. We have developed a system specially built into a Data Collection Vehicle (DCV) which supports the synchronous recording of multichannel audio data from 16 microphones that can be placed in flexible positions, multichannel video data from 3 cameras, and vehicle related data. Multimedia data has been collected for three sessions of spoken dialogue with different modes of navigation, during approximately a 60 minute drive by each of 800 subjects. We have characterized the collected dialogues across the three sessions. Some characteristics such as sentence complexity and SNR are found to differ significantly among the sessions. Linear regression analysis results also clarify the relative importance of various corpus characteristics.
Shou-Kuo SHAO Meng-Guang TSAI Hen-Wai TSAO Paruvelli SREEDEVI Malla REDDY PERATI Jingshown WU
In this paper, we investigate packet loss and system dimensioning of feedback (FB) type wavelength division multiplexing (WDM) optical routers under asynchronous and variable packet length self-similar traffic. We first study the packet loss performance for two different types of WDM optical routers under asynchronous and variable packet length self-similar traffic. Based on simulation results, we demonstrate that a 1616 FB type WDM optical router employing more than 4 re-circulated ports without using void filling (VF) algorithm has better performance. We then present the system dimensioning issues of FB type WDM optical routers, by showing the performance of FB type WDM optical routers as a function of the number of re-circulated ports, buffer depth, re-circulation limit, basic delay unit in the fiber delay line optical buffers and traffic characteristics. The sensitivity of the mutual effects of the above parameters on packet loss is investigated in details. Based on our results, we conclude that the FB type WDM optical routers must be dimensioned with the appropriate number of re-circulated ports, re-circulation limits, buffer depth, and optimal basic delay unit in the fiber delay line optical buffers under relevant traffic characteristics to achieve high switching performance.
This paper overviews recent progress in the development of corpus-based spontaneous speech recognition technology. Although speech is in almost any situation spontaneous, recognition of spontaneous speech is an area which has only recently emerged in the field of automatic speech recognition. Broadening the application of speech recognition depends crucially on raising recognition performance for spontaneous speech. For this purpose, it is necessary to build large spontaneous speech corpora for constructing acoustic and language models. This paper focuses on various achievements of a Japanese 5-year national project "Spontaneous Speech: Corpus and Processing Technology" that has recently been completed. Because of various spontaneous-speech specific phenomena, such as filled pauses, repairs, hesitations, repetitions and disfluencies, recognition of spontaneous speech requires various new techniques. These new techniques include flexible acoustic modeling, sentence boundary detection, pronunciation modeling, acoustic as well as language model adaptation, and automatic summarization. Particularly automatic summarization including indexing, a process which extracts important and reliable parts of the automatic transcription, is expected to play an important role in building various speech archives, speech-based information retrieval systems, and human-computer dialogue systems.
Seiichi NAKAGAWA Tomohiro WATANABE Hiromitsu NISHIZAKI Takehito UTSURO
This paper describes an accurate unsupervised speaker adaptation method for lecture style spontaneous speech recognition using multiple LVCSR systems. In an unsupervised speaker adaptation framework, the improvement of recognition performance by adapting acoustic models remarkably depends on the accuracy of labels such as phonemes and syllables. Therefore, extraction of the adaptation data guided by confidence measure is effective for unsupervised adaptation. In this paper, we looked for the high confidence portions based on the agreement between two LVCSR systems, adapted acoustic models using the portions attached with high accurate labels, and then improved the recognition accuracy. We applied our method to the Corpus of Spontaneous Japanese (CSJ) and the method improved the recognition rate by about 2.1% in comparison with a traditional method.
Junichi YAMAGISHI Koji ONISHI Takashi MASUKO Takao KOBAYASHI
This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-dependent modeling, each speaking style and emotional expression is modeled individually. In the second one called style-mixed modeling, each speaking style and emotional expression is treated as one of contexts as well as phonetic, prosodic, and linguistic features, and all speaking styles and emotional expressions are modeled simultaneously by using a single acoustic model. We chose four styles of read speech -- neutral, rough, joyful, and sad -- and compared the above two modeling methods using these styles. The results of subjective evaluation tests show that both modeling methods have almost the same accuracy, and that it is possible to synthesize speech with the speaking style and emotional expression similar to those of the target speech. In a test of classification of styles in synthesized speech, more than 80% of speech samples generated using both the models were judged to be similar to the target styles. We also show that the style-mixed modeling method gives fewer output and duration distributions than the style-dependent modeling method.
Tomohiro OHNO Shigeki MATSUBARA Nobuo KAWAGUCHI Yasuyoshi INAGAKI
Spontaneously spoken Japanese includes a lot of grammatically ill-formed linguistic phenomena such as fillers, hesitations, inversions, and so on, which do not appear in written language. This paper proposes a novel method of robust dependency parsing using a large-scale spoken language corpus, and evaluates the availability and robustness of the method using spontaneously spoken dialogue sentences. By utilizing stochastic information about the appearance of ill-formed phenomena, the method can robustly parse spoken Japanese including fillers, inversions, or dependencies over utterance units. Experimental results reveal that the parsing accuracy reached 87.0%, and we confirmed that it is effective to utilize the location information of a bunsetsu, and the distance information between bunsetsus as stochastic information.
Kazuki ADACHI Tomoki TODA Hiromichi KAWANAMI Hiroshi SARUWATARI Kiyohiro SHIKANO
This research aims to construct a high-quality Japanese TTS (Text-to-Speech) system that has high flexibility in treating prosody. Many TTS systems have implemented a prosody control system but such systems have been fundamentally designed to output speech with a standard pitch and speech rate. In this study, we employ a unit selection-concatenation method and also introduce an analysis-synthesis process to provide precisely controlled prosody in output speech. Speech quality degrades in proportion to the amount of prosody modification, therefore a target cost for prosody is set to evaluate prosodic difference between target prosody and speech candidates in such a unit selection system. However, the conventional cost ignores the original prosody of speech segments, although it is assumed that the quality deterioration tendency varies in relation to the pitch or speech rate of original speech. In this paper, we propose a novel cost function design based on the prosody of speech segments. First, we recorded nine databases of Japanese speech with different prosodic characteristics. Then with respect to the speech databases, we investigated the relationships between the amount of prosody modification and the perceptual degradation. The results indicate that the tendency of perceptual degradation differs according to the prosodic features of the original speech. On the basis of these results, we propose a new cost function design, which changes a cost function according to the prosody of a speech database. Results of preference testing of synthetic speech show that the proposed cost functions generate speech of higher quality than the conventional method.
In this paper, an efficient architecture for an adaptive Reed-Solomon decoder is presented, where the block length n and the message length k can be varied from their minimum allowable values up to their selected values. This eliminates the need of inserting zeros before decoding shortened RS codes. And the error-correcting capability t can be changed adaptively to channel state at every codeword block. The decoder allows efficient decoding in both burst mode and continuous mode, and it permits 3-step pipelined processing based on the modified Euclid's algorithm. Each step in decoding is designed to be clocked by a separate clock. Thus, each step can be efficiently pipelined with no help of multiplexing. Also, it makes it possible to employ no additional buffer even when the decoder input and output clocks are different. The adaptive RS decoder over GF(28) having the error-correcting capability of upto 10 has been designed in VHDL, and successfully synthesized in an FPGA chip. It can be used in a wide range of applications because of its versatility.
Koichi ISHIHARA Kazuaki TAKEDA Fumiyuki ADACHI
In this paper, we propose pilot-assisted decision feedback channel estimation (PA-DFCE) for space-time coded transmit diversity (STTD) in orthogonal frequency division multiplexing (OFDM). Two transmit channels are simultaneously estimated by transmitting the STTD encoded pilot. To improve the tracking ability of the channel estimation against fast fading, decision feedback is also used in addition to pilot. For noise reduction and preventing the error propagation, windowing of the estimated channel impulse response in the time-delay domain is applied. The average bit error rate (BER) performance of OFDM with STTD is evaluated by computer simulation. It is found that the use of PA-DFCE can achieve a degradation in the required Eb/N0 from ideal CE of as small as 0.6 dB for an average BER = 10-3 and requires about 2.4 dB less Eb/N0 compared to differential STTD that requires no CE.
Process-centered software engineering environments (PSEEs) facilitate controlling software processes. Many issues related to PSEEs such as process evolution support have been addressed. We identify an unsolved issue, which is preventing information leakage when the process is being enacted. We developed a model called PsACL for the prevention. This paper proposes PsACL, which offers the following features: (a) controlling both read and write access of software products, (b) preventing indirect information leakage, (c) managing role associations, (d) managing role hierarchies, (e) enforcing static and simple dynamic separation-of-duty constraints, (f) allowing declassification of products, and (g) allowing access control information exchange among software processes.
A combining method for receiver diversity, followed by a Bayesian decision feedback equalizer, is proposed. This eigenvector based combining maximizes the desired part energy of combined channel, on which the equalizer performance mainly depends. The validity of the proposed method is demonstrated by simulations.
Takafumi FUJIMOTO Kazumasa TANAKA Mitsuo TAGUCHI
The electric currents on the upper, lower and side surfaces of the patch conductor in a circular microstrip antenna are calculated by using the integral equation method and the characteristic between the electric currents on the upper and lower surfaces is compared. The integral equation is derived from the boundary condition that the tangential component of the total electric field due to the electric currents on the upper, lower and side surfaces of the patch conductor vanishes on the upper, lower and side surfaces of the patch conductor. The electric fields are derived by using Green's functions in a layered medium due to a horizontal and a vertical electric dipole on those surfaces. The result of numerical calculation shows that the electric current on the lower surface is much bigger than that on the upper surface and the input impedance of microstrip antenna depends on the electric current on the lower surface.
Qi ZHU Noriyuki OHTSUKI Yoshikazu MIYANAGA Norinobu YOSHIDA
This paper proposes a new robust adaptive processing algorithm that is based on the extended least squares (ELS) method with running spectrum filtering (RSF). By utilizing the different characteristics of running spectra between speech signals and noise signals, RSF can retain speech characteristics while noise is effectively reduced. Then, by using ELS, autoregressive moving average (ARMA) parameters can be estimated accurately. In experiments on real speech contaminated by white Gaussian noise and factory noise, we found that the method we propose offered spectrum estimates that were robust against additive noise.
Seung-Kyun RYU Hong-Goo KANG Sung-Kyo JUNG Dae-Hee YOUN
This paper proposes an algorithm to improve the performance of the noise power spectrum estimation using the minimum statistics (MS). The minimum statistics noise estimator (MSNE) that is most efficient for speech enhancement often underestimates noise power when the signal characteristics changes abruptly. The proposed algorithm improves the accuracy of noise estimation by removing harmonic components of the speech signal. Simulation results verify that the performance of the proposed algorithm is better than that of the conventional algorithm in terms of the segmental SNR (SegSNR) and the spectral distance (SD).
M.M. Hafizur RAHMAN Yasushi INOGUCHI Susumu HORIGUCHI
Three-dimensional (3D) wafer stacked implementation (WSI) has been proposed as a promising technology for massively parallel computers. A hierarchical 3D-torus (H3DT) network, which is a 3D-torus network of multiple basic modules in which the basic modules are 3D-mesh networks, has been proposed for efficient 3D-WSI. However, the restricted use of physical links between basic modules in the higher level networks reduces the dynamic communication performance of this network. A torus network has better dynamic communication performance than a mesh network. Therefore, we have modified the H3DT network by replacing the 3D-mesh modules by 3D-tori, calling it a Modified H3DT (MH3DT) network. This paper addresses the architectural details of the MH3DT network and explores aspects such as degree, diameter, cost, average distance, arc connectivity, bisection width, and wiring complexity. We also present a deadlock-free routing algorithm for the MH3DT network using two virtual channels and evaluate the network's dynamic communication performance under the uniform traffic pattern, using the proposed routing algorithm. It is shown that the MH3DT network possesses several attractive features including small diameter, small cost, small average distance, better bisection width, and better dynamic communication performance.
In this letter, we provide a solution to the stabilization problem of a class of Lipschitz nonlinear systems by output feedback. Via the newly proposed nonlinearity characterization function (NCF) concept, we propose an effective method in designing an output feedback controller. Under the suggested sufficient condition which is derived by using the NCF, the proposed control scheme achieves the global exponential stabilization.
The complete subtree (CS) method is widely accepted for the broadcast encryption. A new method for assigning keys in the CS method is proposed in this paper. The essential idea behind the proposed method is to use two trapdoor permutations. Using the trapdoor information, the key management center computes and assigns a key to each terminal so that the terminal can derive all information necessary in the CS method. A terminal has to keep just one key, while log2 N + 1 keys were needed in the original CS method where N is the number of all terminals. The permutations to be used need to satisfy a certain property which is similar to but slightly different from the claw-free property. The needed property, named strongly semi-claw-free property, is formalized in terms of probabilistic polynomial time algorithm, and its relation to the claw-free property is discussed. It is also shown that if the used permutations fulfill the strongly semi-claw-free property, then the proposed method is secure against attacks of malicious users.