The search functionality is under construction.

Keyword Search Result

[Keyword] ASR(14hit)

1-14hit
  • Enhancing Speech Quality in Air Traffic Control Communication Using DIUnet_V-Based Speech Enhancement Techniques Open Access

    Haijun LIANG  Yukun LI  Jianguo KONG  Qicong HAN  Chengyu YU  

     
    PAPER-Speech and Hearing

      Pubricized:
    2023/12/11
      Vol:
    E107-D No:4
      Page(s):
    551-558

    Air Traffic Control (ATC) communication suffers from issues such as high electromagnetic interference, fast speech rate, and low intelligibility, which pose challenges for downstream tasks like Automatic Speech Recognition (ASR). This article aims to research how to enhance the audio quality and intelligibility of civil aviation speech through speech enhancement methods, thereby improving the accuracy of speech recognition and providing support for the digitalization of civil aviation. We propose a speech enhancement model called DIUnet_V (DenseNet & Inception & U-Net & Volume) that combines both time-frequency and time-domain methods to effectively handle the specific characteristics of civil aviation speech, such as predominant electromagnetic interference and fast speech rate. For model evaluation, we assess the denoising and enhancement effects using three metrics: Signal-to-Noise Ratio (SNR), Mean Opinion Score (MOS), and speech recognition error rate. On a simulated ATC training recording dataset, DIUnet_Volume10 achieved an SNR value of 7.3861, showing a 4.5663 improvement compared to the original U-net model. To address the challenge of the absence of clean speech in the ATC working environment, which makes it difficult to accurately calculate SNR, we propose evaluating the denoising effects indirectly based on the recognition performance of an ATC speech recognition system. On a real ATC speech dataset, the average word error rate decreased by 1.79% absolute and the average sentence error rate decreased by 3% absolute for DIUnet_V processed speech compared to the unprocessed speech in the built speech recognition system.

  • Code-Switching ASR and TTS Using Semisupervised Learning with Machine Speech Chain

    Sahoko NAKAYAMA  Andros TJANDRA  Sakriani SAKTI  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Pubricized:
    2021/07/08
      Vol:
    E104-D No:10
      Page(s):
    1661-1677

    The phenomenon where a speaker mixes two or more languages within the same conversation is called code-switching (CS). Handling CS is challenging for automatic speech recognition (ASR) and text-to-speech (TTS) because it requires coping with multilingual input. Although CS text or speech may be found in social media, the datasets of CS speech and corresponding CS transcriptions are hard to obtain even though they are required for supervised training. This work adopts a deep learning-based machine speech chain to train CS ASR and CS TTS with each other with semisupervised learning. After supervised learning with monolingual data, the machine speech chain is then carried out with unsupervised learning of either the CS text or speech. The results show that the machine speech chain trains ASR and TTS together and improves performance without requiring the pair of CS speech and corresponding CS text. We also integrate language embedding and language identification into the CS machine speech chain in order to handle CS better by giving language information. We demonstrate that our proposed approach can improve the performance on both a single CS language pair and multiple CS language pairs, including the unknown CS excluded from training data.

  • Effects of Automated Transcripts on Non-Native Speakers' Listening Comprehension

    Xun CAO  Naomi YAMASHITA  Toru ISHIDA  

     
    PAPER-Human-computer Interaction

      Pubricized:
    2017/11/24
      Vol:
    E101-D No:3
      Page(s):
    730-739

    Previous research has shown that transcripts generated by automatic speech recognition (ASR) technologies can improve the listening comprehension of non-native speakers (NNSs). However, we still lack a detailed understanding of how ASR transcripts affect listening comprehension of NNSs. To explore this issue, we conducted two studies. The first study examined how the current presentation of ASR transcripts impacted NNSs' listening comprehension. 20 NNSs engaged in two listening tasks, each in different conditions: C1) audio only and C2) audio+ASR transcripts. The participants pressed a button whenever they encountered a comprehension problem, and explained each problem in the subsequent interviews. From our data analysis, we found that NNSs adopted different strategies when using the ASR transcripts; some followed the transcripts throughout the listening; some only checked them when necessary. NNSs also appeared to face difficulties following imperfect and slightly delayed transcripts while listening to speech - many reported difficulties concentrating on listening/reading or shifting between the two. The second study explored how different display methods of ASR transcripts affected NNSs' listening experiences. We focused on two display methods: 1) accuracy-oriented display which shows transcripts only after the completion of speech input analysis, and 2) speed-oriented display which shows the interim analysis results of speech input. We conducted a laboratory experiment with 22 NNSs who engaged in two listening tasks with ASR transcripts presented via the two display methods. We found that the more the NNSs paid attention to listening to the audio, the more they tended to prefer the speed-oriented transcripts, and vice versa. Mismatched transcripts were found to have negative effects on NNSs' listening comprehension. Our findings have implications for improving the presentation methods of ASR transcripts to more effectively support NNSs.

  • Improved End-to-End Speech Recognition Using Adaptive Per-Dimensional Learning Rate Methods

    Xuyang WANG  Pengyuan ZHANG  Qingwei ZHAO  Jielin PAN  Yonghong YAN  

     
    LETTER-Acoustic modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2550-2553

    The introduction of deep neural networks (DNNs) leads to a significant improvement of the automatic speech recognition (ASR) performance. However, the whole ASR system remains sophisticated due to the dependent on the hidden Markov model (HMM). Recently, a new end-to-end ASR framework, which utilizes recurrent neural networks (RNNs) to directly model context-independent targets with connectionist temporal classification (CTC) objective function, is proposed and achieves comparable results with the hybrid HMM/DNN system. In this paper, we investigate per-dimensional learning rate methods, ADAGRAD and ADADELTA included, to improve the recognition of the end-to-end system, based on the fact that the blank symbol used in CTC technique dominates the output and these methods give frequent features small learning rates. Experiment results show that more than 4% relative reduction of word error rate (WER) as well as 5% absolute improvement of label accuracy on the training set are achieved when using ADADELTA, and fewer epochs of training are needed.

  • Robust ASR Based on ETSI Advanced Front-End Using Complex Speech Analysis

    Keita HIGA  Keiichi FUNAKI  

     
    PAPER

      Vol:
    E98-A No:11
      Page(s):
    2211-2219

    The advanced front-end (AFE) for automatic speech recognition (ASR) was standardized by the European Telecommunications Standards Institute (ETSI). The AFE provides speech enhancement realized by an iterative Wiener filter (IWF) in which a smoothed FFT spectrum over adjacent frames is used to design the filter. We have previously proposed robust time-varying complex Auto-Regressive (TV-CAR) speech analysis for an analytic signal and evaluated the performance of speech processing such as F0 estimation and speech enhancement. TV-CAR analysis can estimate more accurate spectrum than FFT, especially in low frequencies because of the nature of the analytic signal. In addition, TV-CAR can estimate more accurate speech spectrum against additive noise. In this paper, a time-invariant version of wide-band TV-CAR analysis is introduced to the IWF in the AFE and is evaluated using the CENSREC-2 database and its baseline script.

  • Improving the Readability of ASR Results for Lectures Using Multiple Hypotheses and Sentence-Level Knowledge

    Yasuhisa FUJII  Kazumasa YAMAMOTO  Seiichi NAKAGAWA  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:4
      Page(s):
    1101-1111

    This paper presents a novel method for improving the readability of automatic speech recognition (ASR) results for classroom lectures. Because speech in a classroom is spontaneous and contains many ill-formed utterances with various disfluencies, the ASR result should be edited to improve the readability before presenting it to users, by applying some operations such as removing disfluencies, determining sentence boundaries, inserting punctuation marks and repairing dropped words. Owing to the presence of many kinds of domain-dependent words and casual styles, even state-of-the-art recognizers can only achieve a 30-50% word error rate for speech in classroom lectures. Therefore, a method for improving the readability of ASR results is needed to make it robust to recognition errors. We can use multiple hypotheses instead of the single-best hypothesis as a method to achieve a robust response to recognition errors. However, if the multiple hypotheses are represented by a lattice (or a confusion network), it is difficult to utilize sentence-level knowledge, such as chunking and dependency parsing, which are imperative for determining the discourse structure and therefore imperative for improving readability. In this paper, we propose a novel algorithm that infers clean, readable transcripts from spontaneous multiple hypotheses represented by a confusion network while integrating sentence-level knowledge. Automatic and manual evaluations showed that using multiple hypotheses and sentence-level knowledge is effective to improve the readability of ASR results, while preserving the understandability.

  • Adaptation to Pronunciation Variations in Indonesian Spoken Query-Based Information Retrieval

    Dessi Puji LESTARI  Sadaoki FURUI  

     
    PAPER-Adaptation

      Vol:
    E93-D No:9
      Page(s):
    2388-2396

    Recognition errors of proper nouns and foreign words significantly decrease the performance of ASR-based speech applications such as voice dialing systems, speech summarization, spoken document retrieval, and spoken query-based information retrieval (IR). The reason is that proper nouns and words that come from other languages are usually the most important key words. The loss of such words due to misrecognition in turn leads to a loss of significant information from the speech source. This paper focuses on how to improve the performance of Indonesian ASR by alleviating the problem of pronunciation variation of proper nouns and foreign words (English words in particular). To improve the proper noun recognition accuracy, proper-noun specific acoustic models are created by supervised adaptation using maximum likelihood linear regression (MLLR). To improve English word recognition, the pronunciation of English words contained in the lexicon is fixed by using rule-based English-to-Indonesian phoneme mapping. The effectiveness of the proposed method was confirmed through spoken query based Indonesian IR. We used Inference Network-based (IN-based) IR and compared its results with those of the classical Vector Space Model (VSM) IR, both using a tf-idf weighting schema. Experimental results show that IN-based IR outperforms VSM IR.

  • Speech Enhancement Using Improved Adaptive Null-Forming in Frequency Domain with Postfilter

    Heng ZHANG  Qiang FU  Yonghong YAN  

     
    LETTER-Speech and Hearing

      Vol:
    E91-A No:12
      Page(s):
    3812-3816

    In this letter, a two channel frequency domain speech enhancement algorithm is proposed. The algorithm is designed to achieve better overall performance with relatively small array size. An improved version of adaptive null-forming is used, in which noise cancelation is implemented in auditory subbands. And an OM-LSA based postfiltering stage further purifies the output. The algorithm also features interaction between the array processing and the postfilter to make the filter adaptation more robust. This approach achieves considerable improvement on signal-to-noise ratio (SNR) and subjective quality of the desired speech. Experiments confirm the effectiveness of the proposed system.

  • Adaptive Search Range Algorithms for Variable Block Size Motion Estimation in H.264/AVC

    Zhenxing CHEN  Yang SONG  Takeshi IKENAGA  Satoshi GOTO  

     
    PAPER

      Vol:
    E91-A No:4
      Page(s):
    1015-1022

    Comparing with search pattern motion estimation (ME) algorithms, adaptive search range (ASR) algorithms are more fundamental, regular and flexible. In variable block size motion estimation (VBSME), ASR algorithms can be applied whether on a whole frame (frame level), or on an entire macroblock which includes up to forty-one blocks (macroblock level), or just on a single block (block level). In the other hand, in H.264/AVC, not the motion vectors (MVs) but the motion vector differences (MVDs) are coded and the median motion vector predictors (median-MVPs) are used to place the search centers. In this sense, it can be thought that the search windows (SWs) are centered at the positions pointed by median-MVPs, the search ranges (SRs) play the role of limiting MVDs. Thus it is reasonable for considering using MVDs to predict SRs. In this paper, one of the MB level and two of the block level, at all three MVD based SR prediction algorithms are proposed. VBSME based experiments are carried out to assess the proposed algorithms. Comparisons between the proposed three algorithms and the previously proposed one given in [8] are done in terms of encoding quality and computational complexity.

  • Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training

    Tobias CINCAREK  Tomoki TODA  Hiroshi SARUWATARI  Kiyohiro SHIKANO  

     
    PAPER-Acoustic Modeling

      Vol:
    E91-D No:3
      Page(s):
    499-507

    Development of an ASR application such as a speech-oriented guidance system for a real environment is expensive. Most of the costs are due to human labeling of newly collected speech data to construct the acoustic model for speech recognition. Employment of existing models or sharing models across multiple applications is often difficult, because the characteristics of speech depend on various factors such as possible users, their speaking style and the acoustic environment. Therefore, this paper proposes a combination of unsupervised learning and selective training to reduce the development costs. The employment of unsupervised learning alone is problematic due to the task-dependency of speech recognition and because automatic transcription of speech is error-prone. A theoretically well-defined approach to automatic selection of high quality and task-specific speech data from an unlabeled data pool is presented. Only those unlabeled data which increase the model likelihood given the labeled data are employed for unsupervised training. The effectivity of the proposed method is investigated with a simulation experiment to construct adult and child acoustic models for a speech-oriented guidance system. A completely human-labeled database which contains real-environment data collected over two years is available for the development simulation. It is shown experimentally that the employment of selective training alleviates the problems of unsupervised learning, i.e. it is possible to select speech utterances of a certain speaker group but discard noise inputs and utterances with lower recognition accuracy. The simulation experiment is carried out for several selected combinations of data collection and human transcription period. It is found empirically that the proposed method is especially effective if only relatively few of the collected data can be labeled and transcribed by humans.

  • Pitch-Synchronous Peak-Amplitude (PS-PA)-Based Feature Extraction Method for Noise-Robust ASR

    Muhammad GHULAM  Kouichi KATSURADA  Junsei HORIKAWA  Tsuneo NITTA  

     
    PAPER-Speech and Hearing

      Vol:
    E89-D No:11
      Page(s):
    2766-2774

    A novel pitch-synchronous auditory-based feature extraction method for robust automatic speech recognition (ASR) is proposed. A pitch-synchronous zero-crossing peak-amplitude (PS-ZCPA)-based feature extraction method was proposed previously and it showed improved performances except when modulation enhancement was integrated with Wiener filter (WF)-based noise reduction and auditory masking. However, since zero-crossing is not an auditory event, we propose a new pitch-synchronous peak-amplitude (PS-PA)-based method to render the feature extractor of ASR more auditory-like. We also examine the effects of WF-based noise reduction, modulation enhancement, and auditory masking in the proposed PS-PA method using the Aurora-2J database. The experimental results show superiority of the proposed method over the PS-ZCPA and other conventional methods. Furthermore, the problem due to the reconstruction of zero-crossings from a modulated envelope is eliminated. The experimental results also show the superiority of PS over PA in terms of the robustness of ASR, though PS and PA lead to significant improvement when applied together.

  • Distorted Speech Rejection for Automatic Speech Recognition in Wireless Communication

    Joon-Hyuk CHANG  Nam Soo KIM  

     
    LETTER-Speech and Hearing

      Vol:
    E87-D No:7
      Page(s):
    1978-1981

    This letter introduces a pre-rejection technique for wireless channel distorted speech with application to automatic speech recognition (ASR). Based on analysis of distorted speech signals over a wireless communication channel, we propose a method to reject the channel distorted speech with a small computational load. From a number of simulation results, we can discover that the pre-rejection algorithm enhances the robustness of speech recognition operation.

  • Orthogonalized Distinctive Phonetic Feature Extraction for Noise-Robust Automatic Speech Recognition

    Takashi FUKUDA  Tsuneo NITTA  

     
    PAPER

      Vol:
    E87-D No:5
      Page(s):
    1110-1118

    In this paper, we propose a noise-robust automatic speech recognition system that uses orthogonalized distinctive phonetic features (DPFs) as input of HMM with diagonal covariance. In an orthogonalized DPF extraction stage, first, a speech signal is converted to acoustic features composed of local features (LFs) and ΔP, then a multilayer neural network (MLN) with 153 output units composed of context-dependent DPFs of a preceding context DPF vector, a current DPF vector, and a following context DPF vector maps the LFs to DPFs. Karhunen-Loeve transform (KLT) is then applied to orthogonalize each DPF vector in the context-dependent DPFs, using orthogonal bases calculated from a DPF vector that represents 38 Japanese phonemes. Each orthogonalized DPF vector is finally decorrelated one another by using Gram-Schmidt orthogonalization procedure. In experiments, after evaluating the parameters of the MLN input and output units in the DPF extractor, the orthogonalized DPFs are compared with original DPFs. The orthogonalized DPFs are then evaluated in comparison with a standard parameter set of MFCCs and dynamic features. Next, noise robustness is tested using four types of additive noise. The experimental results show that the use of the proposed orthogonalized DPFs can significantly reduce the error rate in an isolated spoken-word recognition task both with clean speech and with speech contaminated by additive noise. Furthermore, we achieved significant improvements when combining the orthogonalized DPFs with conventional static MFCCs and ΔP.

  • A 256 mA 0.72 V Ground Bounce Output Driver

    Pang-Cheng YU  Hun-Hsien CHANG  Jiin-Chuan WU  

     
    PAPER-Integrated Electronics

      Vol:
    E83-C No:5
      Page(s):
    767-776

    A new output driver design called modified asymmetrical slew rate (MASR) output driver was proposed to reduce the simultaneous switching noise without sacrificing switching speed, for high speed and heavy loading applications. The driving capability of the output driver was designed to sink/source 64 mA current @ VOL/VOH = 0.4 V/4.6 V, with 66 pF and 50 Ω loading. When four drivers switch simultaneously, the ground bounce was design to be less than 0.8 V. The performances of the conventional, controlled slew rate (CSR), and MASR output drivers were analyzed by computer simulation. These three types of drivers were implemented with a 0.8 µm CMOS process. The measured ground bounce of the conventional driver is 1.22 V, while the ground bounce of the MASR driver is reduced to 0.72 V. The propagation delays of the conventional and MASR drivers are the same. The performance of the MASR driver is better than that of the CSR driver in all aspects.