1-10hit |
This study investigates a band extension technique for speech data encoded with G.711, the most common codec for digital speech communications system such as VoIP. The proposed technique employs steganography for the transmission of the side information required for the band extension. Due to the steganography, the proposed technique is able to enhance the speech quality without an increase of the amount of data transmission. From the results of a subjective experiment, it is indicated that the proposed technique may potentially be useful for improving the speech quality, compared with the conventional technique.
A high quality speech synthesis technique based on the wavelet subband analysis of speech signals was newly devised for enhancing the naturalness of synthesized voiced consonant speech. The technique reproduces a speech characteristic of voiced consonant speech that shows unvoiced feature remarkably in the high frequency subbands. For mixing appropriately the unvoiced feature into voiced speech, a noise inclusion procedure that employed the discrete wavelet transform was proposed. This paper also describes a developed speech synthesizer that employs several random fractal techniques. These techniques were employed for enhancing especially the naturalness of synthesized purely voiced speech. Three types of fluctuations, (1) pitch period fluctuation, (2) amplitude fluctuation, and (3) waveform fluctuation were treated in the speech synthesizer. In addition, instead of a normal impulse train, a triangular pulse was used as a simple model for the glottal excitation pulse. For the compensation for the degraded frequency characteristic of the triangular pulse that overdecreases than the spectral -6 dB/oct characteristic required for the glottal excitation pulse, the random fractal interpolation technique was applied. In order to evaluate the developed speech synthesis system, psychoacoustic experiments were carried out. The experiments especially focused on how the mixed excitation scheme effectively contributed to enhancing the naturalness of voiced consonant speech. In spite that the proposed techniques were just a little modification for enhancing the conventional LPC (linear predictive coding) speech synthesizer, the subjective evaluation suggested that the system could effectively gain the naturalness of the synthesized speech that tended to degrade in the conventional LPC speech synthesis scheme.
Yuto MATSUNAGA Tetsuya KOJIMA Naofumi AOKI Yoshinori DOBASHI Tsuyoshi YAMAMOTO
We have proposed a novel concept of a digital watermarking technique for music data that focuses on the use of sound synthesis and sound effect techniques. This paper describes the details of our proposed technique that employs the distortion effect, one of the most common sound effects frequently utilized especially for guitar and bass instruments. This paper describes the experimental results of evaluating the resistance of the proposed technique against some basic malicious attacks utilizing MP3 coding, tempo alteration, pitch alteration, and high-pass filtering. It is demonstrated that the proposed technique potentially has appropriate resistance against such attacks except for the high-pass filtering attack. A technique for increasing the resistance against the high-pass filtering attack is also supplementarily discussed.
This study investigates a band extension technique for narrow-band telephony speech. The proposed technique employs full wave rectification that nonlinearly generates high-band overtones from the low band. In order to improve the conventional technique, this study investigates a frame-by-frame gain control based on the estimation of gain parameter from narrow-band telephony speech. A subjective evaluation indicates that the proposed technique outperforms the conventional technique.
Hideaki TAMORI Naofumi AOKI Tsuyoshi YAMAMOTO
This paper suggests that a watermarking technique based on the number theoretic transform (NTT) may effectively be employed for detecting alterations on lossless digital master images. Due to its fragility, the NTT-based technique is sensitive to detecting alterations, compared with that based on the discrete Fourier transform (DFT).
Kosei OZEKI Naofumi AOKI Saki ANAZAWA Yoshinori DOBASHI Kenichi IKEDA Hiroshi YASUDA
This study has developed a system that performs data communications using high frequency bands of sound signals. Unlike radio communication systems using advanced wireless devices, it only requires the legacy devices such as microphones and speakers employed in ordinary telephony communication systems. In this study, we have investigated the possibility of a machine learning approach to improve the recognition accuracy identifying binary symbols exchanged through sound media. This paper describes some experimental results evaluating the performance of our proposed technique employing a neural network as its classifier of binary symbols. The experimental results indicate that the proposed technique may have a certain appropriateness for designing an optimal classifier for the symbol identification task.
This study proposes a technique of lossless steganography for G.711, the most common codec for digital speech communications systems such as VoIP. The proposed technique exploits the characteristics of G.711 for embedding steganogram information without degradation. This paper shows the capacity of the proposed technique.
The naturalness of normal sustained vowels is considered to be attributable to the fluctuations observed in the steady part where speech signal is seemingly almost periodic. There always exist two kinds of involuntary fluctuations in the steady part of sustained vowels, even if the sustained vowels are phonated as steadily as possible. One is pitch period fluctuation and the other is waveform fluctuation. In this study, frequency analyses on these fluctuations were conducted in order to investigate their general characteristics. The results of the analyses suggested that the frequency characteristics of the fluctuations were possible to be approximated as 1/fβ-like, which is regarded as the specific feature of random fractal. Therefore, a procedure based on random fractal generation methods was proposed in order to produce these fluctuations for the improvement of the voice quality of synthesized sustained vowels. A series of psychoacoustic experiments was also conducted to evaluate the proposed technique. Experimental results indicated that the proposed technique was effective for synthesized sustained vowels to be perceived as human-like. Unlike the sustained vowels which were synthesized without pitch period fluctuation nor waveform fluctuation, the synthesized sustained vowels which contained the fluctuations were not perceived as buzzer-like, which is the major problem of the voice quality of synthesized sustained vowels. However, it was also found that both of the fluctuations were not always the acoustic cues for the naturalness of normal sustained vowels. The synthesized sustained vowels which contained the fluctuations whose frequency characteristics were the same as that of white noise were perceived as noise-like, which is not at all the voice quality of normal sustained vowels. The results of psychoacoustic experiments indicated that the frequency characteristics of the fluctuations, which are possible to be modeled as 1/fβ-like, were the significant factors for the naturalness of normal sustained vowels.
Noriko KOMAKI Naofumi AOKI Tsuyoshi YAMAMOTO
Speech quality of VoIP (Voice over Internet Protocol) may potentially be degraded by transmission errors such as packet loss and delay that are basically inevitable in best-effort communications. This study newly proposes an error concealment technique for such degradation by taking account of both sender-based and receiver-based techniques. In the proposed technique, sender-based side information, which is required by the receiver-based technique, is transmitted by using steganography, so that its datagram is completely compatible with the conventional format of VoIP. From experimental results of objective evaluation, it is indicated that the proposed technique may potentially be useful for improvement of speech quality, compared with the conventional technique.
Speech quality of VoIP (Voice over Internet Protocol) may potentially be degraded by transmission errors such as packet loss and delay which are basically inevitable in best-effort communications. This study investigates an error concealment technique for such degradation by using a receiver-based technique called pitch waveform replication. For enhancing the conventional technique, this study proposes a waveform reconstruction technique that also takes account of the pitch variation between the backward and forward frames of gap frames. From experimental results of objective evaluation, it is indicated that the proposed technique may potentially be useful for improving the speech quality, compared with the conventional technique.