Representation Learning of Tongue Dynamics for a Silent Speech Interface

Hongcui WANG; Pierre ROUSSEL; Bruce DENBY

doi:10.1587/transinf.2021EDP7090

Representation Learning of Tongue Dynamics for a Silent Speech Interface

Hongcui WANG, Pierre ROUSSEL, Bruce DENBY

Full Text Views

0

Cite this

Summary :

A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.

Publication: IEICE TRANSACTIONS on Information Vol.E104-D No.12 pp.2209-2217

Publication Date: 2021/12/01

Publicized: 2021/08/24

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2021EDP7090

Type of Manuscript: PAPER

Category: Speech and Hearing

Authors

Hongcui WANG
  Zhejiang University of Water Resources and Electric Power,Sorbonne Université
Pierre ROUSSEL
  Sorbonne Université
Bruce DENBY
  Sorbonne Université

Keyword

Silent Speech Interface, representation learning, video prediction

Cite this

Copy

Hongcui WANG, Pierre ROUSSEL, Bruce DENBY, "Representation Learning of Tongue Dynamics for a Silent Speech Interface" in IEICE TRANSACTIONS on Information, vol. E104-D, no. 12, pp. 2209-2217, December 2021, doi: 10.1587/transinf.2021EDP7090.
Abstract: A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7090/_p

Copy

@ARTICLE{e104-d_12_2209,
author={Hongcui WANG, Pierre ROUSSEL, Bruce DENBY, },
journal={IEICE TRANSACTIONS on Information},
title={Representation Learning of Tongue Dynamics for a Silent Speech Interface},
year={2021},
volume={E104-D},
number={12},
pages={2209-2217},
abstract={A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.},
keywords={},
doi={10.1587/transinf.2021EDP7090},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Representation Learning of Tongue Dynamics for a Silent Speech Interface
T2 - IEICE TRANSACTIONS on Information
SP - 2209
EP - 2217
AU - Hongcui WANG
AU - Pierre ROUSSEL
AU - Bruce DENBY
PY - 2021
DO - 10.1587/transinf.2021EDP7090
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2021
AB - A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.
ER -