The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] SPE(2504hit)

1581-1600hit(2504hit)

  • Compensation of Speech Coding Distortion for Wireless Speech Recognition

    Hong Kook KIM  

     
    LETTER-Speech and Hearing

      Vol:
    E87-D No:6
      Page(s):
    1596-1600

    In this paper, we perform some experiments to show that the quantization noise caused by low-bit-rate speech coding can be characterized as a white noise process. Then, the signal-to-quantization noise ratio of the decoded speech for a given bit-rate is estimated by observing the perceptual speech quality equivalent to the artificially generated noisy speech obtained by adding a white Gaussian noise source. This information is incorporated into the parameter tuning of a noise-robust compensation algorithm for speech recognition so that the compensation algorithm can be performed better under a range of the estimated SNRs. Finally, we apply the compensation algorithm to a connected digit string recognition system that utilizes speech signals decoded by the GSM adaptive multi-rate (AMR) speech coder. It is shown that the noise-robust compensation algorithm reduces word error rates by 15% or more at low bit-rate modes of the AMR speech coder.

  • A Bipolar ECL Comparator for a 4 GS/s and 6-Bit Flash A-to-D Converter

    Shinya KAWADA  Yasuhiro SUGIMOTO  

     
    LETTER

      Vol:
    E87-C No:6
      Page(s):
    1022-1024

    A high-speed bipolar ECL comparator circuit with a latch is described. The spike noise generated by charging the base-to-emitter diffusion capacitor on the transition of differential transistors' switching in a sample-and-latch circuit is reduced by inserting the emitter degeneration resistors so that neither of them becomes completely cut off. The frequency bandwidth of a pre-amplifier is increased by using coupled inductors as differential loads. As a result, -3 dB frequency bandwidth of a pre-amplifier becomes 10 GHz, and 4 GS/s operation with 6-bit equivalent precision from a 3.3 V power supply is confirmed by the circuit simulation using device parameters from the 25 GHz silicon bipolar process.

  • Methods of Improving the Accuracy and Reproducibility of Objective Quality Assessment of VoIP Speech

    Akira TAKAHASHI  Masataka MASUDA  Atsuko KURASHIMA  

     
    PAPER-Multimedia Systems

      Vol:
    E87-B No:6
      Page(s):
    1660-1669

    VoIP is one of the key technologies for recent telecommunication services. The quality of its services should be discussed in subjective terms. Since subjective quality assessment is time-consuming and expensive, however, objective quality assessment which estimates subjective quality without carrying out subjective quality experiments is desirable. This paper discusses the performance of the objective quality measure that was standardized as ITU-T Recommendation P.862 and clarifies the quality factors that can be evaluated with satisfactory accuracy based on it. We found that P.862 can be applied to the evaluation of coding distortion, tandeming of codecs, transmission bit-errors, packet loss, and silence compression in a codec, at least for clean Japanese speech. In addition, we propose a method of estimating the subjective quality evaluation value from objective measurement results and show the validity of this method. We also evaluate the uniqueness of objective quality assessment based on P.862 from the viewpoints of the effect of measurement noise and the variation of test speech samples, and propose how to improve the reproducibility of objective quality assessment.

  • A High-Speed and Area-Efficient Dual-Rail PLA Using Divided and Interdigitated Column Circuits

    Hiroaki YAMAOKA  Makoto IKEDA  Kunihiro ASADA  

     
    PAPER-Integrated Electronics

      Vol:
    E87-C No:6
      Page(s):
    1069-1077

    This paper presents a new high-speed and area-efficient dual-rail PLA. The proposed circuit includes three schemes: 1) a divided column scheme (DCS), 2) a programmable sense-amplifier activation scheme (PSAS), and 3) an interdigitated column scheme (ICS). In the DCS, a column circuit of a PLA is divided and each circuit operates in parallel. This enhances the performance of the PLA, and this scheme becomes more effective as input data bandwidth increases. The PSAS is used to generate an activation pulse for sense amplifiers in the PLA. In this scheme, the proposed delay generators enable to minimize a timing margin depending on process variations and operating conditions. The ICS is used to enhance the area-efficiency of the PLA, where a method of physical compaction is employed. This scheme is effective for circuits which have the regularity in logic function such as arithmetic circuits. As applications of the proposed PLA, a comparator, a priority encoder, and an incrementor for 128-bit data processing were designed. The proposed circuit design schemes achieved a 22.2% delay reduction and a 37.5% area reduction on average over the conventional high-speed and low-power PLA in a 0.13-µm CMOS technology with a supply voltage of 1.2 V.

  • TAJODA: Proposed Tactile and Jog Dial Interface for the Blind

    Chieko ASAKAWA  Hironobu TAKAGI  Shuichi INO  Tohru IFUKUBE  

     
    PAPER

      Vol:
    E87-D No:6
      Page(s):
    1405-1414

    There is a fatal difference in obtaining information between sighted people and the blind. Screen reading technology assists blind people in accessing digital documents by themselves helping to bridge such gap. However, these days they are becoming much more visual using various types of visual effects for sighted people to explore the information intuitively at a glance. It is very hard to convey visual effects non-visually and intuitively while retaining the original effects. In addition, it takes a long time to explore the information, since blind people use the keyboard for exploration, while sighted people use eye movement. This research aims at improving the non-visual exploration interface and improving the quality of non-visual information. Therefore, TAJODA (tactile jog dial interface) was proposed to solve these problems. It presents verbal information (text information) in the form of speech, while nonverbal information (visual effects) is represented in the form of tactile sensations. It uses a jog dial as an exploration device, which makes it possible to explore forward or backward intuitively in the speech information by spinning the jog dial clockwise or counterclockwise. It also integrates a tactile device to represent visual effects non-visually. Both speech and tactile information can be synchronized with the dial movements. The speed of spinning the dial affects the speech rate. The main part of this paper describes an experimental evaluation of the effectiveness of the proposed TAJODA interface. The experimental system used a preprocessed recorded human voice as test data. The training sessions showed that it was easy to learn how to use TAJODA. The comparison test session clearly showed that the subjects could perform the comparison task using TAJODA significantly faster (2.4 times faster) than with the comparison method that is closest to the existing screen reading function. Through this experiment, our results showed that TAJODA can drastically improve the non-visual exploration interface.

  • A Study of Aspect Ratio of the Aperture and the Effect on Antenna Efficiency in Oversized Rectangular Slotted Waveguide Arrays

    Hisahiro KAI  Jiro HIROKAWA  Makoto ANDO  

     
    PAPER-Antenna and Propagation

      Vol:
    E87-B No:6
      Page(s):
    1623-1630

    A post-wall waveguide-fed parallel plate slotted array is an attractive candidate for high efficiency and mass producible planar array antennas for millimeter wave applications. For the slot design of this large sized array, a periodic boundary wall model based on the assumption of infinite array size and a parallel waveguide is used. In fact, the aperture is large but still finite (10-40 wavelength) and the TEM-like wave is perturbed due to the narrow walls at the periphery of the aperture as well as the slot coupling; antenna efficiency is affected by the size and the aspect ratio of the aperture. All these observations imply the unique defects of oversized waveguide arrays. In this paper, the aperture efficiency of post-wall waveguide arrays is assessed as a function of size and aspect ratio of the aperture for the first time, both in theory and measurement. An effective field analysis for an electrically large oversized waveguide array, developed by the author, is utilized for determining the slot excitation coefficients and aperture illumination. It is predicted that the oversized waveguide array has a potential efficiency of 80-90% if the aperture is larger than 18 wavelength on a side and the gain is more than 30 dBi. A transversely wide aperture generally provides higher efficiency than a longitudinally long aperture, provided a perfectly uniform TEM wave would be launched from the feed waveguide.

  • Emerging Market for Mobile Remote Physiological Monitoring Services

    Timothy BOLT  Sadahiko KANO  Akihisa KODATE  

     
    PAPER

      Vol:
    E87-D No:6
      Page(s):
    1446-1453

    This paper offers an initial analysis of economic and market issues in the development and deployment of mobile remote physiological monitoring services for medical patients through wireless wearable sensors and actuators. Examining the characteristics of the service technologies and related industries, this study focuses on the structure, participants and roles of standardisation of the layers within the emerging mobile remote physiological monitoring industry. The study concludes that the structure of the emerging mobile remote physiological monitoring industry will be oriented about service provision, be integrated with other personal / patient data storage services and be heavily influenced by the interplay of technological developments, the health market structure, existing players and regulation. Additionally, the keys players are likely to be the system integrators and service providers concentrating on large institutional customers. A focus of the paper is analysing both the causes and implications of a modular, horizontally layered industry structure likely to result from the mix of technologies, suppliers and customers as this market develops. The paper discusses why, although horizontal specialisation is the most likely outcome, there is little risk of key layers becoming commoditised. The paper also discusses the appropriate types and levels of standardisation and equipment certification activities that should be encouraged, along with from which groups and industries the pressure for these will come.

  • Synchronized Multicast Media Streaming Employing Server-Client Coordinated Adaptive Playout and Error Control

    Jinyong JO  JongWon KIM  

     
    PAPER-Multimedia Systems

      Vol:
    E87-B No:6
      Page(s):
    1670-1680

    A new inter-client synchronization framework employing a server-client coordinated adaptive playout and error control toward one-to-many (i.e., multicast) media streaming is discussed in this paper. The proposed adaptive playout mechanism controls the playout speed of audio and video by adopting the time-scale modification of audio. Based on the overall synchronization status as well as the buffer occupancy level, the playout speed of each client is manipulated within a perceptually tolerable range. By coordinating the playout speed of each client, the inter-client synchronization with respect to the target presentation time is smoothly achieved. Furthermore, RTCP-compatible signaling between the server and group-clients is performed to achieve the inter-client synchronization and error recovery, where the exchange of controlling message is restricted. Simulation results show the performance of the proposed multicast media streaming framework.

  • Exploring Human Speech Production Mechanisms by MRI

    Kiyoshi HONDA  Hironori TAKEMOTO  Tatsuya KITAMURA  Satoru FUJITA  Sayoko TAKANO  

     
    INVITED PAPER

      Vol:
    E87-D No:5
      Page(s):
    1050-1058

    Recent investigations using magnetic resonance imaging (MRI) of human speech organs have opened up new avenues of research. Visualization of the speech production system provides abundant information on the physiological and acoustic realization of human speech. This article summarizes the current status of MRI applications with respect to speech research as well as our own experience of discovery and re-evaluation of acoustic events emanating from the vocal tract and physiological mechanisms.

  • What are the Essential Cues for Understanding Spoken Language?

    Steven GREENBERG  Takayuki ARAI  

     
    INVITED PAPER

      Vol:
    E87-D No:5
      Page(s):
    1059-1070

    Classical models of speech recognition assume that a detailed, short-term analysis of the acoustic signal is essential for accurately decoding the speech signal and that this decoding process is rooted in the phonetic segment. This paper presents an alternative view, one in which the time scales required to accurately describe and model spoken language are both shorter and longer than the phonetic segment, and are inherently wedded to the syllable. The syllable reflects a singular property of the acoustic signal -- the modulation spectrum -- which provides a principled, quantitative framework to describe the process by which the listener proceeds from sound to meaning. The ability to understand spoken language (i.e., intelligibility) vitally depends on the integrity of the modulation spectrum within the core range of the syllable (3-10 Hz) and reflects the variation in syllable emphasis associated with the concept of prosodic prominence ("accent"). A model of spoken language is described in which the prosodic properties of the speech signal are embedded in the temporal dynamics associated with the syllable, a unit serving as the organizational interface among the various tiers of linguistic representation.

  • Speaker Adaptation Method for Acoustic-to-Articulatory Inversion using an HMM-Based Speech Production Model

    Sadao HIROYA  Masaaki HONDA  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1071-1078

    We present a speaker adaptation method that makes it possible to determine articulatory parameters from an unknown speaker's speech spectrum using an HMM (Hidden Markov Model)-based speech production model. The model consists of HMMs of articulatory parameters for each phoneme and an articulatory-to-acoustic mapping that transforms the articulatory parameters into a speech spectrum for each HMM state. The model is statistically constructed by using actual articulatory-acoustic data. In the adaptation method, geometrical differences in the vocal tract as well as the articulatory behavior in the reference model are statistically adjusted to an unknown speaker. First, the articulatory parameters are estimated from an unknown speaker's speech spectrum using the reference model. Secondly, the articulatory-to-acoustic mapping is adjusted by maximizing the output probability of the acoustic parameters for the estimated articulatory parameters of the unknown speaker. With the adaptation method, the RMS error between the estimated articulatory parameters and the observed ones is 1.65 mm. The improvement rate over the speaker independent model is 56.1 %.

  • Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition

    Takashi FUKUDA  Tsuneo NITTA  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1110-1118

    In this paper, we propose a noise-robust automatic speech recognition system that uses orthogonalized distinctive phonetic features (DPFs) as input of HMM with diagonal covariance. In an orthogonalized DPF extraction stage, first, a speech signal is converted to acoustic features composed of local features (LFs) and ΔP, then a multilayer neural network (MLN) with 153 output units composed of context-dependent DPFs of a preceding context DPF vector, a current DPF vector, and a following context DPF vector maps the LFs to DPFs. Karhunen-Loeve transform (KLT) is then applied to orthogonalize each DPF vector in the context-dependent DPFs, using orthogonal bases calculated from a DPF vector that represents 38 Japanese phonemes. Each orthogonalized DPF vector is finally decorrelated one another by using Gram-Schmidt orthogonalization procedure. In experiments, after evaluating the parameters of the MLN input and output units in the DPF extractor, the orthogonalized DPFs are compared with original DPFs. The orthogonalized DPFs are then evaluated in comparison with a standard parameter set of MFCCs and dynamic features. Next, noise robustness is tested using four types of additive noise. The experimental results show that the use of the proposed orthogonalized DPFs can significantly reduce the error rate in an isolated spoken-word recognition task both with clean speech and with speech contaminated by additive noise. Furthermore, we achieved significant improvements when combining the orthogonalized DPFs with conventional static MFCCs and ΔP.

  • A Study on Acoustic Modeling for Speech Recognition of Predominantly Monosyllabic Languages

    Ekkarit MANEENOI  Visarut AHKUPUTRA  Sudaporn LUKSANEEYANAWIN  Somchai JITAPUNKUL  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1146-1163

    This paper presents a study on acoustic modeling for speech recognition of predominantly monosyllabic languages. Various speech units used in speech recognition systems have been investigated. To evaluate the effectiveness of these acoustic models, the Thai language is selected, since it is a predominantly monosyllabic language and has a complex vowel system. Several experiments have been carried out to find the proper speech unit that can accurately create acoustic model and give a higher recognition rate. Results of recognition rates under different acoustic models are given and compared. In addition, this paper proposes a new speech unit for speech recognition, namely onset-rhyme unit. Two models are proposed-the Phonotactic Onset-Rhyme Model (PORM) and the Contextual Onset-Rhyme Model (CORM). The models comprise a pair of onset and rhyme units, which makes up a syllable. An onset comprises an initial consonant and its transition towards the following vowel. Together with the onset, the rhyme consists of a steady vowel segment and a final consonant. Experimental results show that the onset-rhyme model improves on the efficiency of other speech units. The onset-rhyme model improves on the accuracy of the inter-syllable triphone model by nearly 9.3% and of the context-dependent Initial-Final model by nearly 4.7% for the speaker-dependent systems using only an acoustic model, and 5.6% and 4.5% for the speaker-dependent systems using both acoustic and language model respectively. The results show that the onset-rhyme models attain a high recognition rate. Moreover, they also give more efficiency in terms of system complexity.

  • One-Pass Semi-Dynamic Network Decoding Using a Subnetwork Caching Model for Large Vocabulary Continuous Speech Recongnition

    Dong-Hoon AHN  Minhwa CHUNG  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1164-1174

    This paper presents a new decoding framework for large vocabulary continuous speech recognition that can handle a static search network dynamically. Generally, a static network decoder can use a search space that is globally optimized in advance, and therefore it can run at high speed during decoding. However, its large memory requirement due to the large network size or the spatial complexity of the optimization algorithm often makes it impractical. Our new one-pass semi-dynamic network decoding scheme aims at incorporating such an optimized search network with memory efficiency, but without losing speed. In this framework, a complete search network is organized on the basis of self-structuring subnetworks and is nearly minimized using a modified tail-sharing algorithm. While the decoder runs, it caches subnetworks needed for decoding in memory, whereas static network decoders keep the complete network in memory. The subnetwork caching model is controlled by two levels of caches: local cache obtained by subnetwork caching operations and global cache obtained by subnetwork preloading operations. The model can also be controlled adaptively by using subnetwork profiling operations. Furthermore, it is made simple and fast with compactly designed self-structuring subnetworks. Experimental results on a 25 k-word Korean broadcast news transcription task show that the semi-dynamic decoder can run almost as fast as an equivalent static network decoder under various memory configurations by using the subnetwork caching model.

  • Noise Robust Speech Recognition Using F0 Contour Information

    Koji IWANO  Takahiro SEKI  Sadaoki FURUI  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1102-1109

    This paper proposes a noise robust speech recognition method using prosodic information. In Japanese, the fundamental frequency (F0) contour represents phrase intonation and word accent information. Consequently, it conveys information about prosodic phrases and word boundaries. This paper first describes a noise robust F0 extraction method using the Hough transform, which achieves high extraction rates under various noise environments. Then it proposes a robust speech recognition method using multi-stream HMMs which model both segmental spectral and F0 contour information. Speaker-independent experiments are conducted using connected digits uttered by 11 male speakers in various kinds of noise and SNR conditions. The recognition error rate is reduced in all noise conditions, and the best absolute improvement of digit accuracy is about 4.5%. This improvement is achieved by robust digit boundary detection using the prosodic information.

  • Robust Speaker Identification System Based on Multilayer Eigen-Codebook Vector Quantization

    Ching-Tang HSIEH  Eugene LAI  Wan-Chen CHEN  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1185-1193

    This paper presents some effective methods for improving the performance of a speaker identification system. Based on the multiresolution property of the wavelet transform, the input speech signal is decomposed into various frequency subbands in order not to spread noise distortions over the entire feature space. For capturing the characteristics of the vocal tract, the linear predictive cepstral coefficients (LPCC) of the lower frequency subband for each decomposition process are calculated. In addition, a hard threshold technique for the lower frequency subband in each decomposition process is also applied to eliminate the effect of noise interference. Furthermore, cepstral domain feature vector normalization is applied to all computed features in order to provide similar parameter statistics in all acoustic environments. In order to effectively utilize all these multiband speech features, we propose a modified vector quantization as the identifier. This model uses the multilayer concept to eliminate the interference among the multiband speech features and then uses the principal component analysis (PCA) method to evaluate the codebooks for capturing a more detailed distribution of the speaker's phoneme characteristics. The proposed method is evaluated using the KING speech database for text-independent speaker identification. Experimental results show that the recognition performance of the proposed method is better than those of the vector quantization (VQ) and the Gaussian mixture model (GMM) using full-band LPCC and mel-frequency cepstral coefficients (MFCC) features in both clean and noisy environments. Also, a satisfactory performance can be achieved in low SNR environments.

  • Phoneme-Balanced and Digit-Sequence-Preserving Connected Digit Patterns for Text-Prompted Speaker Verification

    Tsuneo KATO  Tohru SHIMIZU  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1194-1199

    This paper presents a novel design of connected digit patterns to achieve high accuracy text-prompted speaker verification over a cellular phone network. To reduce the error rate, a phoneme-balanced connected digit pattern for enrollment, and digit-sequence-preserving connected digit patterns for verification (i.e. patterns preserving partial digit sequences of the enrollment pattern) are proposed. In addition to these, a decision procedure using multiple patterns has been designed to overcome the low quality of cellular phone speech. Experimental results on cellular phone speech showed the phoneme-balanced patterns for enrollment and digit-sequence-preserving patterns for verification reduced more than 50% of equal error rate compared to the conventional method using randomly-selected and randomly-reordered digit patterns. The decision procedure reduced 60% of the error rate. In addition, this paper shows that verification patterns depending on the pattern of a preceding utterance reduced 10% of the error rate. Overall, the error rate obtained by the proposed method was 1% for 99% of clients and 95% of impostors.

  • VLSI Layout of Trees into Grids of Minimum Width

    Akira MATSUBAYASHI  

     
    PAPER

      Vol:
    E87-A No:5
      Page(s):
    1059-1069

    In this paper we consider the VLSI layout (i.e., Manhattan layout) of graphs into grids with minimum width (i.e., the length of the shorter side of a grid) as well as with minimum area. The layouts into minimum area and minimum width are equivalent to those with the largest possible aspect ratio of a minimum area layout. Thus such a layout has a merit that, by "folding" the layout, a layout of all possible aspect ratio can be obtained with increase of area within a small constant factor. We show that an N-vertex tree with layout-width k (i.e., the minimum width of a grid into which the tree can be laid out is k) can be laid out into a grid of area O(N) and width O(k). For binary tree layouts, we give a detailed trade-off between area and width: an N-vertex binary tree with layout-width k can be laid out into area and width k + α, where α is an arbitrary integer with 0 α , and the area is existentially optimal for any k 1 and α 0. This implies that α = Ω(k) is essential for a layout of a graph into optimal area. The layouts proposed here can be constructed in polynomial time. We also show that the problem of laying out a given graph G into given area and width, or equivalently, into given length and width is NP-hard even if G is restricted to a binary tree.

  • Improved HMM Separation for Distant-Talking Speech Recognition

    Tetsuya TAKIGUCHI  Masafumi NISHIMURA  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1127-1137

    In distant-talking speech recognition, the recognition accuracy is seriously degraded by reverberation and environmental noise. A robust speech recognition technique in such environments, HMM separation and composition, has been described in. HMM separation estimates the model parameters of the acoustic transfer function using adaptation data uttered from an unknown position in noisy and reverberant environments, and HMM composition builds an HMM of noisy and reverberant speech, using the acoustic transfer function estimated by HMM separation. Previously, HMM separation has been applied to the acoustic transfer function based on a single Gaussian distribution. However the improvement was smaller than expected for the impulse response with long reverberations. This is because the variance of the acoustic transfer function in each frame increases, since the length of the impulse response of the room reverberation is longer than that of the spectral analysis window. In this paper, HMM separation is extended to estimate the acoustic transfer function based on the Gaussian mixture components in order to compensate for the greater variability of the acoustic transfer function, and the re-estimation formulae are derived. In addition, this paper introduces a technique to adapt the noise weight for each mel-spaced frequency in order to improve the performance of the HMM separation in the linear-spectral domain, since the use of the HMM separation in the linear-spectral domain sometimes causes a negative mean output due to the subtraction operation. The extended HMM separation is evaluated on distant-talking speech recognition tasks. The results of the experiments clarify the effectiveness of the proposed method.

  • Applied Multi-Wavelet Feature to Text Independent Speaker Identification

    Shung-Yung LUNG  

     
    LETTER-Speech and Hearing

      Vol:
    E87-A No:4
      Page(s):
    944-945

    A new speaker feature extracted from multi-wavelet decomposition for speaker recognition is described. The multi-wavelet decomposition is a multi-scale representation of the covariance matrix. We have combined wavelet transform and the multi-resolution singular value algorithm to decompose eigenvector for speaker feature extraction not at the square matrix. Our results have shown that this multi-wavelet feature introduced better performance than the cepstrum and Δ-cepstrum with respect to the percentages of recognition.

1581-1600hit(2504hit)