This paper focuses on two areas in our effort to synthesize speech from neuromotor input using neural network models that effect transforms between cognitive intentions to speak, their physiological effects on vocal tract structures, and subsequent realization as acoustic signals. The first area concerns the biomechanical transform between motor commands to muscles and the ensuing articulator behavior. Using physiological data of muscle EMG (electromyography) and articulator movements during natural English speech utterances, three articulator-specific neural networks learn the forward dynamics that relate motor commands to the muscles and motion of the tongue, jaw, ant lips. Compared to a fully-connected network, mapping muscle EMG and motion for all three sets of articulators at once, this modular approach has improved performance by reducing network complexity and has eliminated some of the confounding influence of functional coupling among articulators. Network independence has also allowed us to identify and assess the effects of technical and empirical limitations on an articulator-by-articulator basis. This is particularly important for modeling the tongue whose complex structure is very difficult to examine empirically. The second area of progress concerns the transform between articulator motion and the speech acoustics. From the articulatory movement trajectories, a second neural network generates PARCOR (partial correlation) coefficients which are then used to synthesize the speech acoustics. In the current implementation, articulator velocities have been added as the inputs to the network. As a result, the model now follows the fast changes of the coefficients for consonants generated by relatively slow articulatory movements during natural English utterances. Although much work still needs to be done, progress in these areas brings us closer to our goal of emulating speech production processes computationally.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Makoto HIRAYAMA, Eric Vatikiotis-BATESON, Mitsuo KAWATO, "Physiologically-Based Speech Synthesis Using Neural Networks" in IEICE TRANSACTIONS on Fundamentals,
vol. E76-A, no. 11, pp. 1898-1910, November 1993, doi: .
Abstract: This paper focuses on two areas in our effort to synthesize speech from neuromotor input using neural network models that effect transforms between cognitive intentions to speak, their physiological effects on vocal tract structures, and subsequent realization as acoustic signals. The first area concerns the biomechanical transform between motor commands to muscles and the ensuing articulator behavior. Using physiological data of muscle EMG (electromyography) and articulator movements during natural English speech utterances, three articulator-specific neural networks learn the forward dynamics that relate motor commands to the muscles and motion of the tongue, jaw, ant lips. Compared to a fully-connected network, mapping muscle EMG and motion for all three sets of articulators at once, this modular approach has improved performance by reducing network complexity and has eliminated some of the confounding influence of functional coupling among articulators. Network independence has also allowed us to identify and assess the effects of technical and empirical limitations on an articulator-by-articulator basis. This is particularly important for modeling the tongue whose complex structure is very difficult to examine empirically. The second area of progress concerns the transform between articulator motion and the speech acoustics. From the articulatory movement trajectories, a second neural network generates PARCOR (partial correlation) coefficients which are then used to synthesize the speech acoustics. In the current implementation, articulator velocities have been added as the inputs to the network. As a result, the model now follows the fast changes of the coefficients for consonants generated by relatively slow articulatory movements during natural English utterances. Although much work still needs to be done, progress in these areas brings us closer to our goal of emulating speech production processes computationally.
URL: https://global.ieice.org/en_transactions/fundamentals/10.1587/e76-a_11_1898/_p
Copy
@ARTICLE{e76-a_11_1898,
author={Makoto HIRAYAMA, Eric Vatikiotis-BATESON, Mitsuo KAWATO, },
journal={IEICE TRANSACTIONS on Fundamentals},
title={Physiologically-Based Speech Synthesis Using Neural Networks},
year={1993},
volume={E76-A},
number={11},
pages={1898-1910},
abstract={This paper focuses on two areas in our effort to synthesize speech from neuromotor input using neural network models that effect transforms between cognitive intentions to speak, their physiological effects on vocal tract structures, and subsequent realization as acoustic signals. The first area concerns the biomechanical transform between motor commands to muscles and the ensuing articulator behavior. Using physiological data of muscle EMG (electromyography) and articulator movements during natural English speech utterances, three articulator-specific neural networks learn the forward dynamics that relate motor commands to the muscles and motion of the tongue, jaw, ant lips. Compared to a fully-connected network, mapping muscle EMG and motion for all three sets of articulators at once, this modular approach has improved performance by reducing network complexity and has eliminated some of the confounding influence of functional coupling among articulators. Network independence has also allowed us to identify and assess the effects of technical and empirical limitations on an articulator-by-articulator basis. This is particularly important for modeling the tongue whose complex structure is very difficult to examine empirically. The second area of progress concerns the transform between articulator motion and the speech acoustics. From the articulatory movement trajectories, a second neural network generates PARCOR (partial correlation) coefficients which are then used to synthesize the speech acoustics. In the current implementation, articulator velocities have been added as the inputs to the network. As a result, the model now follows the fast changes of the coefficients for consonants generated by relatively slow articulatory movements during natural English utterances. Although much work still needs to be done, progress in these areas brings us closer to our goal of emulating speech production processes computationally.},
keywords={},
doi={},
ISSN={},
month={November},}
Copy
TY - JOUR
TI - Physiologically-Based Speech Synthesis Using Neural Networks
T2 - IEICE TRANSACTIONS on Fundamentals
SP - 1898
EP - 1910
AU - Makoto HIRAYAMA
AU - Eric Vatikiotis-BATESON
AU - Mitsuo KAWATO
PY - 1993
DO -
JO - IEICE TRANSACTIONS on Fundamentals
SN -
VL - E76-A
IS - 11
JA - IEICE TRANSACTIONS on Fundamentals
Y1 - November 1993
AB - This paper focuses on two areas in our effort to synthesize speech from neuromotor input using neural network models that effect transforms between cognitive intentions to speak, their physiological effects on vocal tract structures, and subsequent realization as acoustic signals. The first area concerns the biomechanical transform between motor commands to muscles and the ensuing articulator behavior. Using physiological data of muscle EMG (electromyography) and articulator movements during natural English speech utterances, three articulator-specific neural networks learn the forward dynamics that relate motor commands to the muscles and motion of the tongue, jaw, ant lips. Compared to a fully-connected network, mapping muscle EMG and motion for all three sets of articulators at once, this modular approach has improved performance by reducing network complexity and has eliminated some of the confounding influence of functional coupling among articulators. Network independence has also allowed us to identify and assess the effects of technical and empirical limitations on an articulator-by-articulator basis. This is particularly important for modeling the tongue whose complex structure is very difficult to examine empirically. The second area of progress concerns the transform between articulator motion and the speech acoustics. From the articulatory movement trajectories, a second neural network generates PARCOR (partial correlation) coefficients which are then used to synthesize the speech acoustics. In the current implementation, articulator velocities have been added as the inputs to the network. As a result, the model now follows the fast changes of the coefficients for consonants generated by relatively slow articulatory movements during natural English utterances. Although much work still needs to be done, progress in these areas brings us closer to our goal of emulating speech production processes computationally.
ER -