IEICE global.ieice.org Site

Author Search Result

[Author] Nam Soo KIM(20hit)

1-20hit

Estimation of Phone Mismatch Penalty Matricesfor Two-Stage Keyword Spotting
Chang Woo HAN Shin Jae KANG Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E93-D No:8
Page(s):
2331-2335
In this letter, we propose a novel approach to estimate three different kinds of phone mismatch penalty matrices for two-stage keyword spotting. When the output of a phone recognizer is given, detection of a specific keyword is carried out through text matching with the phone sequences provided by the specified keyword using the proposed phone mismatch penalty matrices. The penalty matrices associated with substitution, insertion and deletion errors are estimated from the training data through deliberate error generation. The proposed approach has shown a significant improvement in a Korean continuous speech recognition task.
Outlier Detection and Removal for HMM-Based Speech Synthesis with an Insufficient Speech Database
Doo Hwa HONG June Sig SUNG Kyung Hwan OH Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E95-D No:9
Page(s):
2351-2354
Decision tree-based clustering and parameter estimation are essential steps in the training part of an HMM-based speech synthesis system. These two steps are usually performed based on the maximum likelihood (ML) criterion. However, one of the drawbacks of the ML criterion is that it is sensitive to outliers which usually result in quality degradation of the synthesized speech. In this letter, we propose an approach to detect and remove outliers for HMM-based speech synthesis. Experimental results show that the proposed approach can improve the synthetic speech, particularly when the available training speech database is insufficient.
On Detecting Target Acoustic Signals Based on Non-negative Matrix Factorization
Yu Gwang JIN Nam Soo KIM

LETTER-Pattern Recognition

Vol:
E93-D No:4
Page(s):
922-925
In this paper, we propose a novel target acoustic signal detection approach which is based on non-negative matrix factorization (NMF). Target basis vectors are trained from the target signal database through NMF, and input vectors are projected onto the subspace spanned by these target basis vectors. By analyzing the distribution of time-varying normalized projection error, the optimal threshold can be calculated to detect the target signal intervals during the entire input signal. Experimental results show that the proposed algorithm can detect the target signal successfully under various signal environments.
Improved Frame Mode Selection for AMR-WB+ Based on Decision Tree
Jong Kyu KIM Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E91-D No:6
Page(s):
1830-1833
In this letter, we propose a coding mode selection method for the AMR-WB+ audio coder based on a decision tree. In order to reduce computation while maintaining good performance, decision tree classifier is adopted with the closed loop mode selection results as the target classification labels. The size of the decision tree is controlled by pruning, so the proposed method does not increase the memory requirement significantly. Through an evaluation test on a database covering both speech and music materials, the proposed method is found to achieve a much better mode selection accuracy compared with the open loop mode selection module in the AMR-WB+.
Improved Global Soft Decision Using Smoothed Global Likelihood Ratio for Speech Enhancement
Joon-Hyuk CHANG Dong Seok JEONG Nam Soo KIM Sangki KANG

LETTER-Multimedia Systems for Communications

Vol:
E90-B No:8
Page(s):
2186-2189
In this letter, we propose an improved global soft decision for noisy speech enhancement. From an investigation of statistical model-based speech enhancement, it is discovered that a global soft decision has a fundamental drawback at the speech tail regions of speech signals. For that reason, we propose a new solution based on a smoothed likelihood ratio for the global soft decision. Performances of the proposed method are evaluated by subjective tests under various environments and show better results compared with the our previous work.
Supervised Denoising Pre-Training for Robust ASR with DNN-HMM
Shin Jae KANG Kang Hyun LEE Nam Soo KIM

LETTER-Speech and Hearing

Pubricized:
2015/09/07
Vol:
E98-D No:12
Page(s):
2345-2348
In this letter, we propose a novel supervised pre-training technique for deep neural network (DNN)-hidden Markov model systems to achieve robust speech recognition in adverse environments. In the proposed approach, our aim is to initialize the DNN parameters such that they yield abstract features robust to acoustic environment variations. In order to achieve this, we first derive the abstract features from an early fine-tuned DNN model which is trained based on a clean speech database. By using the derived abstract features as the target values, the standard error back-propagation algorithm with the stochastic gradient descent method is performed to estimate the initial parameters of the DNN. The performance of the proposed algorithm was evaluated on Aurora-4 DB, and better results were observed compared to a number of conventional pre-training methods.
Feature Compensation with Model-Based Estimation for Noise Masking
Young Joon KIM Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E90-D No:2
Page(s):
603-605
In this letter, we propose a new approach to estimate the degree of noise masking based on a sophisticated model for clean speech distribution. This measure, named as noise masking probability (NMP), is incorporated into the feature compensation technique to achieve robust speech recognition in noisy environments. Experimental results show that the proposed approach improves the performance of the baseline recognition system in the presence of various background noises.
Frame Splitting Scheme for Error-Robust Audio Streaming over Packet-Switching Networks
Jong Kyu KIM Jung Su KIM Hwan Sik YUN Joon-Hyuk CHANG Nam Soo KIM

LETTER-Multimedia Systems for Communications

Vol:
E91-B No:2
Page(s):
677-680
This letter presents a novel frame splitting scheme for an error-robust audio streaming over packet-switching networks. In our approach to perceptual audio coding, an audio frame is split into several subframes based on the network configuration such that each packet can be decoded independently at the receiver. Through a subjective comparison category rating (CCR) test, it is discovered that our approach enhances the quality of the decoded audio signal under the lossy packet-switching networks environment.
Distorted Speech Rejection for Automatic Speech Recognition in Wireless Communication
Joon-Hyuk CHANG Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E87-D No:7
Page(s):
1978-1981
This letter introduces a pre-rejection technique for wireless channel distorted speech with application to automatic speech recognition (ASR). Based on analysis of distorted speech signals over a wireless communication channel, we propose a method to reject the channel distorted speech with a small computational load. From a number of simulation results, we can discover that the pre-rejection algorithm enhances the robustness of speech recognition operation.
Speech Enhancement: New Approaches to Soft Decision
Joon-Hyuk CHANG Nam Soo KIM

PAPER-Speech and Hearing

Vol:
E84-D No:9
Page(s):
1231-1240
In this paper, we propose new approaches to speech enhancement based on soft decision. In order to enhance the statistical reliability in estimating speech activity, we introduce the concept of a global speech absence probability (GSAP). First, we compute the conventional speech absence probability (SAP) and then modify it according to the newly proposed GSAP. The modification is made in such a way that the SAP has the same value of GSAP in the case of speech absence while it is maintained to its original value when the speech is present. Moreover, for improving the performance of the SAP's at voice tails (transition periods from speech to silence), we revise the SAP's using a hang-over scheme based on the hidden Markov model (HMM). In addition, we suggest a robust noise update algorithm in which the noise power is estimated not only in the periods of speech absence but also during speech activity based on soft decision. Also, for improving the SAP determination and noise update routines, we present a new signal to noise ratio (SNR) concept which is called the predicted SNR in this paper. Moreover, we demonstrate that the discrete cosine transform (DCT) enhances the accuracy of the SAP estimation. A number of tests show that the proposed method which is called the speech enhancement based on soft decision (SESD) algorithm yields better performance than the conventional approaches.
Target Source Separation Based on Discriminative Nonnegative Matrix Factorization Incorporating Cross-Reconstruction Error
Kisoo KWON Jong Won SHIN Nam Soo KIM

LETTER-Speech and Hearing

Pubricized:
2015/08/19
Vol:
E98-D No:11
Page(s):
2017-2020
Nonnegative matrix factorization (NMF) is an unsupervised technique to represent nonnegative data as linear combinations of nonnegative bases, which has shown impressive performance for source separation. However, its source separation performance degrades when one signal can also be described well with the bases for the interfering source signals. In this paper, we propose a discriminative NMF (DNMF) algorithm which exploits the reconstruction error for the interfering signals as well as the target signal based on target bases. The objective function for training the bases is constructed so as to yield high reconstruction error for the interfering source signals while guaranteeing low reconstruction error for the target source signals. Experiments show that the proposed method outperformed the standard NMF and another DNMF method in terms of both the perceptual evaluation of speech quality score and signal-to-distortion ratio in various noisy environments.
Speech Enhancement Based on Data-Driven Residual Gain Estimation
Yu Gwang JIN Nam Soo KIM Joon-Hyuk CHANG

LETTER-Speech and Hearing

Vol:
E94-D No:12
Page(s):
2537-2540
In this letter, we propose a novel speech enhancement algorithm based on data-driven residual gain estimation. The entire system consists of two stages. At the first stage, a conventional speech enhancement algorithm enhances the input signal while estimating several signal-to-noise ratio (SNR)-related parameters. The residual gain, which is estimated by a data-driven method, is applied to further enhance the signal at the second stage. A number of experimental results show that the proposed speech enhancement algorithm outperforms the conventional speech enhancement technique based on soft decision and the data-driven approach using SNR grid look-up table.
Study of Prominence Detection Based on Various Phone-Specific Features
Sung Soo KIM Chang Woo HAN Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E93-D No:8
Page(s):
2327-2330
In this letter, we present useful features accounting for pronunciation prominence and propose a classification technique for prominence detection. A set of phone-specific features are extracted based on a forced alignment of the test pronunciation provided by a speech recognition system. These features are then applied to the traditional classifiers such as the support vector machine (SVM), artificial neural network (ANN) and adaptive boosting (Adaboost) for detecting the place of prominence.
Implementation of HMM-Based Human Activity Recognition Using Single Triaxial Accelerometer
Chang Woo HAN Shin Jae KANG Nam Soo KIM

LETTER-Digital Signal Processing

Vol:
E93-A No:7
Page(s):
1379-1383
In this letter, we propose a novel approach to human activity recognition. We present a class of features that are robust to the tilt of the attached sensor module and a state transition model suitable for HMM-based activity recognition. In addition, postprocessing techniques are applied to stabilize the recognition results. The proposed approach shows significant improvements in recognition experiments over a variety of human activity DB.
Speech Enhancement Based on Perceptually Comfortable Residual Noise
Jong Won SHIN Joon-Hyuk CHANG Nam Soo KIM

LETTER-Multimedia Systems for Communications

Vol:
E90-B No:11
Page(s):
3323-3326
In this letter, we propose a novel approach to speech enhancement, which incorporates a new criterion based on residual noise shaping. In the proposed approach, our goal is to make the residual noise perceptually comfortable instead of making it less audible. A predetermined `comfort noise' is provided as a target for the spectral shaping. Based on some assumptions, the resulting spectral gain function turns out to be a slight modification of the Wiener filter while requiring very low computational complexity. Subjective listening test shows that the proposed algorithm outperforms the conventional spectral enhancement technique based on soft decision and the noise suppression implemented in IS-893 Selectable Mode Vocoder.
Spectral Magnitude Adjustment for MCLT-Based Acoustic Data Transmission
Hwan Sik YUN Kiho CHO Nam Soo KIM

LETTER-Information Network

Vol:
E95-D No:5
Page(s):
1523-1526
Acoustic data transmission is a technique which embeds data in a sound wave imperceptibly and detects it at a receiver. The data are embedded in an original audio signal and transmitted through the air by playing back the data-embedded audio using a loudspeaker. At the receiver, the data are extracted from the received audio signal captured by a microphone. In our previous work, we proposed an acoustic data transmission system designed based on phase modification of the modulated complex lapped transform (MCLT) coefficients. In this paper, we propose the spectral magnitude adjustment (SMA) technique which not only enhances the quality of the data-embedded audio signal but also improves the transmission performance of the system.
A Statistical Model-Based V/UV Decision under Background Noise Environments
Joon-Hyuk CHANG Nam Soo KIM Sanjit K. MITRA

LETTER-Speech and Hearing

Vol:
E87-D No:12
Page(s):
2885-2887
In this letter, we propose an approach to incorporate a statistical model for the voiced/unvoiced (V/UV) speech decision under background noise environments. Our approach consists of splitting the input noisy speech into two separate bands and applying a statistical model for each band. We compute and compare the likelihood ratio (LR) for each band based on the statistical model and estimated noise statistics for the V/UV decision. According to the simulation test, the proposed V/UV decision shows a better performance compared with the selectable mode vocoder (SMV) V/UV decision algorithm, particularly in clean and white noise environments.
DNN-Based Voice Activity Detection with Multi-Task Learning
Tae Gyoon KANG Nam Soo KIM

LETTER-Speech and Hearing

Pubricized:
2015/10/26
Vol:
E99-D No:2
Page(s):
550-553
Recently, notable improvements in voice activity detection (VAD) problem have been achieved by adopting several machine learning techniques. Among them, the deep neural network (DNN) which learns the mapping between the noisy speech features and the corresponding voice activity status with its deep hidden structure has been one of the most popular techniques. In this letter, we propose a novel approach which enhances the robustness of DNN in mismatched noise conditions with multi-task learning (MTL) framework. In the proposed algorithm, a feature enhancement task for speech features is jointly trained with the conventional VAD task. The experimental results show that the DNN with the proposed framework outperforms the conventional DNN-based VAD algorithm.
Statistical Approaches to Excitation Modeling in HMM-Based Speech Synthesis
June Sig SUNG Doo Hwa HONG Hyun Woo KOO Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E96-D No:2
Page(s):
379-382
In our previous study, we proposed the waveform interpolation (WI) approach to model the excitation signals for hidden Markov model (HMM)-based speech synthesis. This letter presents several techniques to improve excitation modeling within the WI framework. We propose both the time domain and frequency domain zero padding techniques to reduce the spectral distortion inherent in the synthesized excitation signal. Furthermore, we apply non-negative matrix factorization (NMF) to obtain a low-dimensional representation of the excitation signals. From a number of experiments, including a subjective listening test, the proposed method has been found to enhance the performance of the conventional excitation modeling techniques.
Computationally Efficient Cepstral Domain Feature Compensation
Woohyung LIM Chang Woo HAN Nam Soo KIM

LETTER-Speech and Hearing

Vol:
E92-D No:1
Page(s):
86-89
In this letter, we propose a novel approach to feature compensation performed in the cepstral domain. Processing in the cepstral domain has the advantage that the spectral correlation among different frequencies is taken into consideration. By introducing a linear approximation with diagonal covariance assumption, we modify the conventional log-spectral domain feature compensation technique to fit to the cepstral domain. The proposed approach shows significant improvements in the AURORA2 speech recognition task.

Author Search Result

[Author] Nam Soo KIM(20hit)

Estimation of Phone Mismatch Penalty Matricesfor Two-Stage Keyword Spotting

Outlier Detection and Removal for HMM-Based Speech Synthesis with an Insufficient Speech Database

On Detecting Target Acoustic Signals Based on Non-negative Matrix Factorization

Improved Frame Mode Selection for AMR-WB+ Based on Decision Tree

Improved Global Soft Decision Using Smoothed Global Likelihood Ratio for Speech Enhancement

Supervised Denoising Pre-Training for Robust ASR with DNN-HMM

Feature Compensation with Model-Based Estimation for Noise Masking

Frame Splitting Scheme for Error-Robust Audio Streaming over Packet-Switching Networks

Distorted Speech Rejection for Automatic Speech Recognition in Wireless Communication

Speech Enhancement: New Approaches to Soft Decision

Target Source Separation Based on Discriminative Nonnegative Matrix Factorization Incorporating Cross-Reconstruction Error

Speech Enhancement Based on Data-Driven Residual Gain Estimation

Study of Prominence Detection Based on Various Phone-Specific Features

Implementation of HMM-Based Human Activity Recognition Using Single Triaxial Accelerometer

Speech Enhancement Based on Perceptually Comfortable Residual Noise

Spectral Magnitude Adjustment for MCLT-Based Acoustic Data Transmission

A Statistical Model-Based V/UV Decision under Background Noise Environments

DNN-Based Voice Activity Detection with Multi-Task Learning

Statistical Approaches to Excitation Modeling in HMM-Based Speech Synthesis

Computationally Efficient Cepstral Domain Feature Compensation

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles