IEICE global.ieice.org Site

Keyword Search Result

[Keyword] speech interface(4hit)

1-4hit

Representation Learning of Tongue Dynamics for a Silent Speech Interface
Hongcui WANG Pierre ROUSSEL Bruce DENBY

PAPER-Speech and Hearing

Pubricized:
2021/08/24
Vol:
E104-D No:12
Page(s):
2209-2217
A Silent Speech Interface (SSI) is a sensor-based, Artificial Intelligence (AI) enabled system in which articulation is performed without the use of the vocal chords, resulting in a voice interface that conserves the ambient audio environment, protects private data, and also functions in noisy environments. Though portable SSIs based on ultrasound imaging of the tongue have obtained Word Error Rates rivaling that of acoustic speech recognition, SSIs remain relegated to the laboratory due to stability issues. Indeed, reliable extraction of acoustic features from ultrasound tongue images in real-life situations has proven elusive. Recently, Representation Learning has shown considerable success in learning underlying structure in noisy, high-dimensional raw data. In its unsupervised form, Representation Learning is able to reveal structure in unlabeled data, thus greatly simplifying the data preparation task. In the present article, a 3D Convolutional Neural Network architecture is applied to unlabeled ultrasound images, and is shown to reliably predict future tongue configurations. By comparing the 3DCNN to a simple previous-frame predictor, it is possible to recognize tongue trajectories comprising transitions between regions of stability that correlate with formant trajectories in a spectrogram of the signal. Prospects for using the underlying structural representation to provide features for subsequent speech processing tasks are presented.
Silent Speech Interface Using Ultrasonic Doppler Sonar
Ki-Seung LEE

PAPER-Speech and Hearing

Pubricized:
2020/05/20
Vol:
E103-D No:8
Page(s):
1875-1887
Some non-acoustic modalities have the ability to reveal certain speech attributes that can be used for synthesizing speech signals without acoustic signals. This study validated the use of ultrasonic Doppler frequency shifts caused by facial movements to implement a silent speech interface system. A 40kHz ultrasonic beam is incident to a speaker's mouth region. The features derived from the demodulated received signals were used to estimate the speech parameters. A nonlinear regression approach was employed in this estimation where the relationship between ultrasonic features and corresponding speech is represented by deep neural networks (DNN). In this study, we investigated the discrepancies between the ultrasonic signals of audible and silent speech to validate the possibility for totally silent communication. Since reference speech signals are not available in silently mouthed ultrasonic signals, a nearest-neighbor search and alignment method was proposed, wherein alignment was achieved by determining the optimal pair of ultrasonic and audible features in the sense of a minimum mean square error criterion. The experimental results showed that the performance of the ultrasonic Doppler-based method was superior to that of EMG-based speech estimation, and was comparable to an image-based method.
TAJODA: Proposed Tactile and Jog Dial Interface for the Blind
Chieko ASAKAWA Hironobu TAKAGI Shuichi INO Tohru IFUKUBE

PAPER

Vol:
E87-D No:6
Page(s):
1405-1414
There is a fatal difference in obtaining information between sighted people and the blind. Screen reading technology assists blind people in accessing digital documents by themselves helping to bridge such gap. However, these days they are becoming much more visual using various types of visual effects for sighted people to explore the information intuitively at a glance. It is very hard to convey visual effects non-visually and intuitively while retaining the original effects. In addition, it takes a long time to explore the information, since blind people use the keyboard for exploration, while sighted people use eye movement. This research aims at improving the non-visual exploration interface and improving the quality of non-visual information. Therefore, TAJODA (tactile jog dial interface) was proposed to solve these problems. It presents verbal information (text information) in the form of speech, while nonverbal information (visual effects) is represented in the form of tactile sensations. It uses a jog dial as an exploration device, which makes it possible to explore forward or backward intuitively in the speech information by spinning the jog dial clockwise or counterclockwise. It also integrates a tactile device to represent visual effects non-visually. Both speech and tactile information can be synchronized with the dial movements. The speed of spinning the dial affects the speech rate. The main part of this paper describes an experimental evaluation of the effectiveness of the proposed TAJODA interface. The experimental system used a preprocessed recorded human voice as test data. The training sessions showed that it was easy to learn how to use TAJODA. The comparison test session clearly showed that the subjects could perform the comparison task using TAJODA significantly faster (2.4 times faster) than with the comparison method that is closest to the existing screen reading function. Through this experiment, our results showed that TAJODA can drastically improve the non-visual exploration interface.
MASCOTS II: A Dialog Manager in General Interface for Speech Input and Output
Yoichi YAMASHITA Hideaki YOSHIDA Takashi HIRAMATSU Yasuo NOMURA Riichiro MIZOGUCHI

PAPER

Vol:
E76-D No:1
Page(s):
74-83
This paper describes a general interface system for speech input and output and a dialog management system, MASCOTS, which is a component of the interface system. The authors designed this interface system, paying attention to its generality; that is, it is not dependent on the problem-solving system it is connected to. The previous version of MASCOTS dealt with the dialog processing only for the speech input based on the SR-plans. We extend MASCOTS to cover the speech output to the user. The revised version of MASCOTS, named MASCOTS II, makes use of topic information given by the topic packet network (TPN) which models the topic transitions in dialogs. Input and output messages are described with the concept representation based on the case structure. For the speech input, prediction of user's utterance is focused and enhanced by using the TPN. The TPN compensates for the shortages of the SR-plan and improves the accuracy of prediction as to stimulus utterances of the user. As the dialog processing in the speech output, MASCOTS II extracts emphatic words and restores missing words to the output message if necessary, e.g., in order to notify the results of speech recognition. The basic mechanisms of the SR-plan and the TPN are shared between the speech input and output processes in MASCOTS II.