IEICE global.ieice.org Site

Keyword Search Result

[Keyword] SPE(2504hit)

881-900hit(2504hit)

Implementation of OFDMA-Based Cognitive Radio via Accessible Interference Temperature
Bin DA Chi-Chung KO

LETTER-Wireless Communication Technologies

Vol:
E93-B No:10
Page(s):
2830-2832
In a conventional downlink OFDMA system, an underlay secondary network is co-located to formulate a new implementation of OFDMA-based cognitive radio (OCR), where spectrum sharing is enabled between primary users and secondary users. With the introduced concept of accessible interference temperature, this new model can be easily implemented and may contribute to the future realization of OCR systems.
Learning Speech Variability in Discriminative Acoustic Model Adaptation
Shoei SATO Takahiro OKU Shinichi HOMMA Akio KOBAYASHI Toru IMAI

PAPER-Adaptation

Vol:
E93-D No:9
Page(s):
2370-2378
We present a new discriminative method of acoustic model adaptation that deals with a task-dependent speech variability. We have focused on differences of expressions or speaking styles between tasks and set the objective of this method as improving the recognition accuracy of indistinctly pronounced phrases dependent on a speaking style. The adaptation appends subword models for frequently observable variants of subwords in the task. To find the task-dependent variants, low-confidence words are statistically selected from words with higher frequency in the task's adaptation data by using their word lattices. HMM parameters of subword models dependent on the words are discriminatively trained by using linear transforms with a minimum phoneme error (MPE) criterion. For the MPE training, subword accuracy discriminating between the variants and the originals is also investigated. In speech recognition experiments, the proposed adaptation with the subword variants reduced the word error rate by 12.0% relative in a Japanese conversational broadcast task.
Intra-Cell Partial Spectrum Reuse Scheme for Cellular OFDM-Relay Networks
Tong WU Ying WANG Yushan PEI Gen LI Ping ZHANG

LETTER-Wireless Communication Technologies

Vol:
E93-B No:9
Page(s):
2462-2464
This letter proposes an intra-cell partial spectrum reuse (PSR) scheme for cellular OFDM-relay networks. The proposed method aims to increase the system throughput, while the SINR of the cell edge users can be also promoted by utilizing the PSR scheme. The novel pre-allocation factor γ not only indicates the flexibility of PSR, but also decreases the complexity of the reuse mechanism. Through simulations, the proposed scheme is shown to offer superior performances in terms of system throughput and SINR of last 5% users.
Acoustic Model Adaptation for Speech Recognition
Koichi SHINODA

INVITED PAPER

Vol:
E93-D No:9
Page(s):
2348-2362
Statistical speech recognition using continuous-density hidden Markov models (CDHMMs) has yielded many practical applications. However, in general, mismatches between the training data and input data significantly degrade recognition accuracy. Various acoustic model adaptation techniques using a few input utterances have been employed to overcome this problem. In this article, we survey these adaptation techniques, including maximum a posteriori (MAP) estimation, maximum likelihood linear regression (MLLR), and eigenvoice. We also present a schematic view called the adaptation pyramid to illustrate how these methods relate to each other.
Distant Speech Recognition Using a Microphone Array Network
Alberto Yoshihiro NAKANO Seiichi NAKAGAWA Kazumasa YAMAMOTO

PAPER-Microphone Array

Vol:
E93-D No:9
Page(s):
2451-2462
In this work, spatial information consisting of the position and orientation angle of an acoustic source is estimated by an artificial neural network (ANN). The estimated position of a speaker in an enclosed space is used to refine the estimated time delays for a delay-and-sum beamformer, thus enhancing the output signal. On the other hand, the orientation angle is used to restrict the lexicon used in the recognition phase, assuming that the speaker faces a particular direction while speaking. To compensate the effect of the transmission channel inside a short frame analysis window, a new cepstral mean normalization (CMN) method based on a Gaussian mixture model (GMM) is investigated and shows better performance than the conventional CMN for short utterances. The performance of the proposed method is evaluated through Japanese digit/command recognition experiments.
Intentional Voice Command Detection for Trigger-Free Speech Interface
Yasunari OBUCHI Takashi SUMIYOSHI

PAPER-Robust Speech Recognition

Vol:
E93-D No:9
Page(s):
2440-2450
In this paper we introduce a new framework of audio processing, which is essential to achieve a trigger-free speech interface for home appliances. If the speech interface works continually in real environments, it must extract occasional voice commands and reject everything else. It is extremely important to reduce the number of false alarms because the number of irrelevant inputs is much larger than the number of voice commands even for heavy users of appliances. The framework, called Intentional Voice Command Detection, is based on voice activity detection, but enhanced by various speech/audio processing techniques such as emotion recognition. The effectiveness of the proposed framework is evaluated using a newly-collected large-scale corpus. The advantages of combining various features were tested and confirmed, and the simple LDA-based classifier demonstrated acceptable performance. The effectiveness of various methods of user adaptation is also discussed.
Efficient Speech Reinforcement Based on Low-Bit-Rate Speech Coding Parameters
Jae-Hun CHOI Joon-Hyuk CHANG Seong-Ro LEE

LETTER-Speech and Hearing

Vol:
E93-A No:9
Page(s):
1684-1687
In this paper, a novel approach to speech reinforcement in a low-bit-rate speech coder under ambient noise environments is proposed. The excitation vector of ambient noise is efficiently obtained at the near-end and then combined with the excitation signal of the far-end for a suitable reinforcement gain within the G.729 CS-ACELP Annex. B framework. For this reason, this can be clearly different from previous approaches in that the present approach does not require an additional arithmetic step such as the discrete Fourier transform (DFT). Experimental results indicate that the proposed method shows better performance than or at least comparable to conventional approaches with a lower computational burden.
Empirical Dispersion Formula of the Conductor-Backed Coplanar Waveguide with Via Holes
Jung Han CHOI

LETTER-Electromagnetic Theory

Vol:
E93-C No:9
Page(s):
1478-1480
An empirical dispersion formula is proposed and experimentally verified considering higher order modes of the conductor-backed coplanar waveguide with via holes. For this purpose, an effective dielectric constant is extracted up to 100 GHz from measured S-parameters. By fitting the extracted data, an empirical equation is extracted. The simulation of the Gaussian pulse transmission and the comparison results with the modeling data validate the reported expression.
HMM-Based Voice Conversion Using Quantized F0 Context
Takashi NOSE Yuhei OTA Takao KOBAYASHI

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2483-2490
We propose a segment-based voice conversion technique using hidden Markov model (HMM)-based speech synthesis with nonparallel training data. In the proposed technique, the phoneme information with durations and a quantized F0 contour are extracted from the input speech of a source speaker, and are transmitted to a synthesis part. In the synthesis part, the quantized F0 symbols are used as prosodic context. A phonetically and prosodically context-dependent label sequence is generated from the transmitted phoneme and the F0 symbols. Then, converted speech is generated from the label sequence with durations using the target speaker's pre-trained context-dependent HMMs. In the model training, the models of the source and target speakers can be trained separately, hence there is no need to prepare parallel speech data of the source and target speakers. Objective and subjective experimental results show that the segment-based voice conversion with phonetic and prosodic contexts works effectively even if the parallel speech data is not available.
Acoustic Feature Optimization Based on F-Ratio for Robust Speech Recognition
Yanqing SUN Yu ZHOU Qingwei ZHAO Yonghong YAN

PAPER-Robust Speech Recognition

Vol:
E93-D No:9
Page(s):
2417-2430
This paper focuses on the problem of performance degradation in mismatched speech recognition. The F-Ratio analysis method is utilized to analyze the significance of different frequency bands for speech unit classification, and we find that frequencies around 1 kHz and 3 kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15 dB and 0 dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.
A Hardware-Efficient Pattern Matching Architecture Using Process Element Tree for Deep Packet Inspection
Seongyong AHN Hyejeong HONG HyunJin KIM Jin-Ho AHN Dongmyong BAEK Sungho KANG

LETTER-Network Management/Operation

Vol:
E93-B No:9
Page(s):
2440-2442
This paper proposes a new pattern matching architecture with multi-character processing for deep packet inspection. The proposed pattern matching architecture detects the start point of pattern matching from multi-character input using input text alignment. By eliminating duplicate hardware components using process element tree, hardware cost is greatly reduced in the proposed pattern matching architecture.
Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions
Longbiao WANG Kazue MINAMI Kazumasa YAMAMOTO Seiichi NAKAGAWA

PAPER-Speaker Recognition

Vol:
E93-D No:9
Page(s):
2397-2406
In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
Unsupervised Speaker Adaptation Using Speaker-Class Models for Lecture Speech Recognition
Tetsuo KOSAKA Yuui TAKEDA Takashi ITO Masaharu KATO Masaki KOHDA

PAPER-Adaptation

Vol:
E93-D No:9
Page(s):
2363-2369
In this paper, we propose a new speaker-class modeling and its adaptation method for the LVCSR system and evaluate the method on the Corpus of Spontaneous Japanese (CSJ). In this method, closer speakers are selected from training speakers and the acoustic models are trained by using their utterances for each evaluation speaker. One of the major issues of the speaker-class model is determining the selection range of speakers. In order to solve the problem, several models which have a variety of speaker range are prepared for each evaluation speaker in advance, and the most proper model is selected on a likelihood basis in the recognition step. In addition, we improved the recognition performance using unsupervised speaker adaptation with the speaker-class models. In the recognition experiments, a significant improvement could be obtained by using the proposed speaker adaptation based on speaker-class models compared with the conventional adaptation method.
Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models
Hironori DOI Keigo NAKAMURA Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2472-2482
This paper presents a novel method of enhancing esophageal speech using statistical voice conversion. Esophageal speech is one of the alternative speaking methods for laryngectomees. Although it doesn't require any external devices, generated voices usually sound unnatural compared with normal speech. To improve the intelligibility and naturalness of esophageal speech, we propose a voice conversion method from esophageal speech into normal speech. A spectral parameter and excitation parameters of target normal speech are separately estimated from a spectral parameter of the esophageal speech based on Gaussian mixture models. The experimental results demonstrate that the proposed method yields significant improvements in intelligibility and naturalness. We also apply one-to-many eigenvoice conversion to esophageal speech enhancement to make it possible to flexibly control the voice quality of enhanced speech.
Improvements of the One-to-Many Eigenvoice Conversion System
Yamato OHTANI Tomoki TODA Hiroshi SARUWATARI Kiyohiro SHIKANO

PAPER-Voice Conversion

Vol:
E93-D No:9
Page(s):
2491-2499
We have developed a one-to-many eigenvoice conversion (EVC) system that allows us to convert a single source speaker's voice into an arbitrary target speaker's voice using an eigenvoice Gaussian mixture model (EV-GMM). This system is capable of effectively building a conversion model for an arbitrary target speaker by adapting the EV-GMM using only a small amount of speech data uttered by the target speaker in a text-independent manner. However, the conversion performance is still insufficient for the following reasons: 1) the excitation signal is not precisely modeled; 2) the oversmoothing of the converted spectrum causes muffled sounds in converted speech; and 3) the conversion model is affected by redundant acoustic variations among a lot of pre-stored target speakers used for building the EV-GMM. In order to address these problems, we apply the following promising techniques to one-to-many EVC: 1) mixed excitation; 2) a conversion algorithm considering global variance; and 3) adaptive training of the EV-GMM. The experimental results demonstrate that the conversion performance of one-to-many EVC is significantly improved by integrating all of these techniques into the one-to-many EVC system.
Effect of Holder Heat Capacity on Bridge Shape at Low Speed Breaking Contact
Kazuaki MIYANAGA Yoshiki KAYANO Tasuku TAKAGI Hiroshi INOUE

BRIEF PAPER

Vol:
E93-C No:9
Page(s):
1456-1459
In order to clarify the physics of contact life time, the relationship between heat capacity of holder and shape of bridge (length and diameter) is discussed in this paper. The AgPd60 alloy is chosen as electrode material. Two holders with different heat capacity are comprised of copper plate and cylinder. The shape of the bridge at the low speed breaking contact is observed by using the high speed digital camera. It was demonstrated that the shape of the bridge is changed by the response and distribution of the temperature.
Enhancing the Robustness of the Posterior-Based Confidence Measures Using Entropy Information for Speech Recognition
Yanqing SUN Yu ZHOU Qingwei ZHAO Pengyuan ZHANG Fuping PAN Yonghong YAN

PAPER-Robust Speech Recognition

Vol:
E93-D No:9
Page(s):
2431-2439
In this paper, the robustness of the posterior-based confidence measures is improved by utilizing entropy information, which is calculated for speech-unit-level posteriors using only the best recognition result, without requiring a larger computational load than conventional methods. Using different normalization methods, two posterior-based entropy confidence measures are proposed. Practical details are discussed for two typical levels of hidden Markov model (HMM)-based posterior confidence measures, and both levels are compared in terms of their performances. Experiments show that the entropy information results in significant improvements in the posterior-based confidence measures. The absolute improvements of the out-of-vocabulary (OOV) rejection rate are more than 20% for both the phoneme-level confidence measures and the state-level confidence measures for our embedded test sets, without a significant decline of the in-vocabulary accuracy.
Characteristics of Break Arcs Driven by Transverse Magnetic Field in a DC High-Voltage Resistive Circuit
Tomohiro ATSUMI Junya SEKIKAWA Takayoshi KUBONO

PAPER

Vol:
E93-C No:9
Page(s):
1393-1398
Break arcs are generated between pure silver electrical contacts in a DC high-voltage resistive circuit. The break arc is driven by the external magnetic field of a permanent magnet from horizontal direction of contacts. Electrical contacts are separated at constant opening speed at 75 mm/s. The maximum supply voltage is 300 V. The maximum circuit current when electrical contacts are closed is 20 A. The maximum output power of the supply is limited to 6.0 kW. The gap between the contacts and the magnet is defined as x. The gap is varied from 2.5 mm to 10.0 mm to change the magnetic flux density that affects the break arc. The break arc is observed with a high-speed camera. The effect of the magnetic field on the arc duration was examined. As a result, break arcs are successfully extinguished by the transverse magnetic field when the gap x is 2.5 mm. Then the length of the break arc just before lengthening of the break arc L and the Lorentz force that affects the break arc F are examined. The length L was almost constant for each gap x and independent of the circuit current I and the Lorentz force F. The break arc is driven by the magnetic field when the arc length reached a certain length that was determined by the strength of the magnetic flux density.
Speech Recognition under Multiple Noise Environment Based on Multi-Mixture HMM and Weight Optimization by the Aspect Model
Seong-Jun HAHM Yuichi OHKAWA Masashi ITO Motoyuki SUZUKI Akinori ITO Shozo MAKINO

PAPER-Robust Speech Recognition

Vol:
E93-D No:9
Page(s):
2407-2416
In this paper, we propose an acoustic model that is robust to multiple noise environments, as well as a method for adapting the acoustic model to an environment to improve the model. The model is called "the multi-mixture model," which is based on a mixture of different HMMs each of which is trained using speech under different noise conditions. Speech recognition experiments showed that the proposed model performs better than the conventional multi-condition model. The method for adaptation is based on the aspect model, which is a "mixture-of-mixture" model. To realize adaptation using extremely small amount of adaptation data (i.e., a few seconds), we train a small number of mixture models, which can be interpreted as models for "clusters" of noise environments. Then, the models are mixed using weights, which are determined according to the adaptation data. The experimental results showed that the adaptation based on the aspect model improved the word accuracy in a heavy noise environment and showed no performance deterioration for all noise conditions, while the conventional methods either did not improve the performance or showed both improvement and degradation of recognition performance according to noise conditions.
A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition
Yoo Rhee OH Hong Kook KIM

PAPER-Adaptation

Vol:
E93-D No:9
Page(s):
2379-2387
In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

881-900hit(2504hit)

Keyword Search Result

[Keyword] SPE(2504hit)

Implementation of OFDMA-Based Cognitive Radio via Accessible Interference Temperature

Learning Speech Variability in Discriminative Acoustic Model Adaptation

Intra-Cell Partial Spectrum Reuse Scheme for Cellular OFDM-Relay Networks

Acoustic Model Adaptation for Speech Recognition

Distant Speech Recognition Using a Microphone Array Network

Intentional Voice Command Detection for Trigger-Free Speech Interface

Efficient Speech Reinforcement Based on Low-Bit-Rate Speech Coding Parameters

Empirical Dispersion Formula of the Conductor-Backed Coplanar Waveguide with Via Holes

HMM-Based Voice Conversion Using Quantized F0 Context

Acoustic Feature Optimization Based on F-Ratio for Robust Speech Recognition

A Hardware-Efficient Pattern Matching Architecture Using Process Element Tree for Deep Packet Inspection

Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions

Unsupervised Speaker Adaptation Using Speaker-Class Models for Lecture Speech Recognition

Esophageal Speech Enhancement Based on Statistical Voice Conversion with Gaussian Mixture Models

Improvements of the One-to-Many Eigenvoice Conversion System

Effect of Holder Heat Capacity on Bridge Shape at Low Speed Breaking Contact

Enhancing the Robustness of the Posterior-Based Confidence Measures Using Entropy Information for Speech Recognition

Characteristics of Break Arcs Driven by Transverse Magnetic Field in a DC High-Voltage Resistive Circuit

Speech Recognition under Multiple Noise Environment Based on Multi-Mixture HMM and Weight Optimization by the Aspect Model

A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles