1-6hit |
Hiroyuki EHARA Kazutoshi YASUNAGA Koji YOSHIDA Yusuke HIWASAKI Kazunori MANO Takao KANEKO
This paper presents a newly developed noise post-processing (NPP) algorithm and the results of several tests demonstrating its subjective performance. This NPP algorithm is designed to improve the subjective performance of low bit-rate code excited linear prediction (CELP) decoding under background noise conditions. The NPP algorithm is based on a stationary noise generator and improves the subjective quality of noisy signal input. A backward adaptive detector defines noisy input signal frames from decoded LSF, energy, and pitch parameters. The noise generator estimates and produces stationary noise signals using past line spectral frequency (LSF) and energy parameters. The stationary noise generator has a frame erasure concealment (FEC) scheme designed for stationary noise signals and therefore improves the speech decoder's robustness for frame erasure under background noise conditions. The algorithm has been applied to the following CELP decoders: 1) a candidate algorithm of the ITU-T 4-kbit/s speech coding standard and 2) existing ITU-T standards, the G.729 and G.723.1 series. In both cases, NPP improved the subjective performance of the baseline decoders. Improvements of approximately 0.25 CMOS (CCR MOS: comparison category rating mean opinion score) and around 0.2-0.8 DMOS (DCR MOS: degradation category rating mean opinion score) were demonstrated in the results of our subjective tests when applied to the 4-kbit/s decoder and G.729/G.723.1 decoders respectively. Other test results show that NPP improves the subjective performance of a G.729 decoder by around 0.45 in DMOS under both error-free and frame-erasure conditions, and a further improvement of around 0.2 DMOS is achieved by the FEC scheme in the noise generator.
Yusuke HIWASAKI Kazunori MANO Kazutoshi YASUNAGA Toshiyuki MORII Hiroyuki EHARA Takao KANEKO
This paper presents an efficient LSP quantizer implementation for low bit-rate coders. The major feature of the quantizer is that it uses a truncated cepstral distance criterion for the code selection procedure. This approach has generally been considered too computationally costly. We utilized the quantizer with a moving-average predictor, two-stage-split vector quantizer and delayed decision. We have investigated the optimal parameter settings in this case and incorporated the quantizer thus obtained into an ITU-T 4-kbit/s speech coding candidate algorithm with a bit budget of 21 bits. The objective performance is better than that with a conventional weighted mean-square criterion, while the complexity is still kept to a reasonable level. The paper also describes the codebook design and techniques that were employed to achieve robustness in noisy channel conditions.
Shoichi KOYAMA Ken'ichi FURUYA Hisashi UEMATSU Yusuke HIWASAKI Yoichi HANEDA
A new real-time sound field transmission system is presented. To construct this system, a large listening area needs to be reproduced at not less than a constant height. Additionally, the driving signals of the loudspeakers should be obtained only from received signals of microphones. Wave field reconstruction (WFR) filtering for linear arrays of microphones and loudspeakers is considered to be suitable for this kind of system. An experimental system was developed to show the feasibility of real-time sound field transmission using the WFR filter. Experiments to measure the reproduced sound field and a subjective listening test of sound localization were conducted to evaluate the proposed system. Although the reproduced sound field included several artifacts such as spatial aliasing and faster amplitude decay, the experimental results indicated that the proposed system was able to provide sound localization accuracy for virtual sound sources comparable to that for real sound sources in a large listening area.
Yusuke HIWASAKI Toru MORINAGA Jotaro IKEDO Akitoshi KATAOKA
This paper presents a way of using a linear regression model to produce a single-valued criterion that indicates the perceived importance of each block in a stream of speech blocks. This method is superior to the conventional approach, voice activity detection (VAD), in that it provides a dynamically changing priority value for speech segments with finer granularity. The approach can be used in conjunction with scalable speech coding techniques in the context of IP QoS services to achieve a flexible form of quality control for speech transmission. A simple linear regression model is used to estimate a mean opinion score (MOS) of the various cases of missing speech segments. The estimated MOS is a continuous value that can be mapped to priority levels with arbitrary granularity. Through subjective evaluation, we show the validity of the calculated priority values.
Masahiro FUKUI Shigeaki SASAKI Yusuke HIWASAKI Kimitaka TSUTSUMI Sachiko KURIHARA Hitoshi OHMURO Yoichi HANEDA
We proposes a new adaptive spectral masking method of algebraic vector quantization (AVQ) for non-sparse signals in the modified discreet cosine transform (MDCT) domain. This paper also proposes switching the adaptive spectral masking on and off depending on whether or not the target signal is non-sparse. The switching decision is based on the results of MDCT-domain sparseness analysis. When the target signal is categorized as non-sparse, the masking level of the target MDCT coefficients is adaptively controlled using spectral envelope information. The performance of the proposed method, as a part of ITU-T G.711.1 Annex D, is evaluated in comparison with conventional AVQ. Subjective listening test results showed that the proposed method improves sound quality by more than 0.1 points on a five-point scale on average for speech, music, and mixed content, which indicates significant improvement.
Yusuke HIWASAKI Hitoshi OHMURO Takeshi MORI Sachiko KURIHARA Akitoshi KATAOKA
This paper proposes a wideband speech coder in which a G.711 bitstream is embedded. This coder has an advantage over conventional coders in that it has a high interoperability with existing terminals so costly transcoding involving decoding and re-encoding can be avoided. We also propose a partial mixing method that effectively reduces the mixing complexity in multiple-point remote conferences. To reduce the complexity, we take advantage of the scalable structure of the bitstream and mix only the lower band of the signal. For the higher band, the main speaker location is selected among remote locations and is redistributed with the mixed lower-band signal. By subjective evaluations, we show that the speech quality can be maintained even when the speech signals are partially mixed.