This paper proposes a new method of phoneme power control for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment and the relationship between speech power and pitch frequency. First, the permissible threshold (PT) for power modification is measured by subjective experiments using power manipulated speech material. As a result, it is concluded that the PT of power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch, and initial or final position in the sentence. Finally, we propose that the phoneme power should be controlled by pitch frequency and phoneme environment. This proposal is implemented in a waveform concatenation type text-to-speech synthesizer. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the permissible threshold of human perception.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Kenzo ITOH, Tomohisa HIROKAWA, Hirokazu SATO, "Phoneme Power Control for Speech Synthesis" in IEICE TRANSACTIONS on Fundamentals,
vol. E76-A, no. 11, pp. 1911-1918, November 1993, doi: .
Abstract: This paper proposes a new method of phoneme power control for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment and the relationship between speech power and pitch frequency. First, the permissible threshold (PT) for power modification is measured by subjective experiments using power manipulated speech material. As a result, it is concluded that the PT of power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch, and initial or final position in the sentence. Finally, we propose that the phoneme power should be controlled by pitch frequency and phoneme environment. This proposal is implemented in a waveform concatenation type text-to-speech synthesizer. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the permissible threshold of human perception.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e76-a_11_1911/_p
Copy
@ARTICLE{e76-a_11_1911,
author={Kenzo ITOH, Tomohisa HIROKAWA, Hirokazu SATO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Phoneme Power Control for Speech Synthesis},
year={1993},
volume={E76-A},
number={11},
pages={1911-1918},
abstract={This paper proposes a new method of phoneme power control for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment and the relationship between speech power and pitch frequency. First, the permissible threshold (PT) for power modification is measured by subjective experiments using power manipulated speech material. As a result, it is concluded that the PT of power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch, and initial or final position in the sentence. Finally, we propose that the phoneme power should be controlled by pitch frequency and phoneme environment. This proposal is implemented in a waveform concatenation type text-to-speech synthesizer. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the permissible threshold of human perception.},
keywords={},
doi={},
ISSN={},
month={November},}
Copy
TY - JOUR
TI - Phoneme Power Control for Speech Synthesis
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1911
EP - 1918
AU - Kenzo ITOH
AU - Tomohisa HIROKAWA
AU - Hirokazu SATO
PY - 1993
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E76-A
IS - 11
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - November 1993
AB - This paper proposes a new method of phoneme power control for speech synthesis by rule. The innovation of this method lies in its use of the phoneme environment and the relationship between speech power and pitch frequency. First, the permissible threshold (PT) for power modification is measured by subjective experiments using power manipulated speech material. As a result, it is concluded that the PT of power modification is 4.1 dB. This experimental result is significant when discussing power control and gives a criterion for power control accuracy. Next, the relationship between speech power and pitch frequency is analyzed using a very large speech data base. The results show that the relationship between phoneme power and pitch frequency is affected by the kind of phoneme, the adjoining phonemes, rising or falling pitch, and initial or final position in the sentence. Finally, we propose that the phoneme power should be controlled by pitch frequency and phoneme environment. This proposal is implemented in a waveform concatenation type text-to-speech synthesizer. This new method yields an averaged root mean square error between real and estimated speech power of 2.17 dB. This value indicates that 94% of the estimated power values are within the permissible threshold of human perception.
ER -