IEICE TRANSACTIONS on Information

Impact Factor

0.59
Eigenfactor

0.002
article influence

0.1
Cite Score

1.4

To the Advance publication
To the Archives

Advance publication (published online immediately after acceptance)

Lightweight Neural Data Sequence Modeling by Scale Causal Blocks
Hiroaki AKUTSU Ko ARAI

Pubricized:
2024/11/08
PAPER
- Summary
- Free PDF (1.6MB)
Recaptured Image Detection Based on Multi-Scale Residual Features of Discriminative Regions
Lanxi LIU Pengpeng YANG Suwen DU Sani M. ABDULLAHI

Pubricized:
2024/11/08
PAPER
- Summary
- Free PDF (5.9MB)
Learn Discriminative Features for Small Object Detection through Multi-scale Image Degradation with Contrastive Learning
Xiaoguang TU Zhi HE Gui FU Jianhua LIU Mian ZHONG Chao ZHOU Xia LEI Juhang YIN Yi HUANG Yu WANG

Pubricized:
2024/11/05
PAPER
- Summary
- Free PDF (1.4MB)
Joint Distribution-Aligned Dual-Sparse Linear Regression for Cross-Stimulus Speech-Based Depression Detection
Yingying LU Cheng LU Yuan ZONG Feng ZHOU Chuangao TANG

Pubricized:
2024/11/01
LETTER
- Summary
- Free PDF (229.4KB)
Multi-grained Guaranteeable Requirement Analysis for Iterative Adaptation
Jialong LI Takuto YAMAUCHI Takanori HIRANO Jinyu CAI Kenji TEI

Pubricized:
2024/10/31
PAPER
- Summary
- Free PDF (567.6KB)
A fully digital transmitting-receiving platform for MIMO radar waveform diversity experiment
Wei LEI Yue ZHANG Hanfeng XIE Zebin CHEN Zengping CHEN Weixing LI

Pubricized:
2024/10/30
PAPER
- Summary
- Free PDF (7MB)
Leveraging Different Boolean Function Decompositions to Reduce T-Count in LUT-based Quantum Circuit Synthesis
David CLARINO Naoya ASADA Atsushi MATSUO Shigeru YAMASHITA

Pubricized:
2024/10/30
PAPER
- Summary
- Free PDF (1.3MB)
Criticality and Tolerance in Injection Timing in Cup-Stacking Method for Collective Communication
Takashi YOKOTA Kanemitsu OOTSU

Pubricized:
2024/10/28
PAPER
- Summary
- Free PDF (1.5MB)
An anchor-free Siamese tracker with multi-attention and corner detection mechanism
Xiaokang Jin Benben Huang Hao Sheng Yao Wu

Pubricized:
2024/10/28
PAPER
- Summary
- Free PDF (3MB)
Effect of Politeness on Trust in Re-enter Requests to User by Smart Speaker -Pilot Study-
Tomoki MIYAMOTO

Pubricized:
2024/10/23
LETTER
- Summary
- Free PDF (2.2MB)
Fine-tuning Models for Final Disagreement Anticipation in Negotiation Mid-Dialogues
Ken WATANABE Katsuhide FUJITA

Pubricized:
2024/10/10
PAPER
- Summary
- Free PDF (3.8MB)
Deepfake speech detection: approaches from acoustic features related to auditory perception to deep neural networks
Masashi UNOKI Kai LI Anuwat CHAIWONGYEN Quoc-Huy NGUYEN Khalid ZAMAN

Pubricized:
2024/10/07
INVITED PAPER
- Summary
- Free PDF (965KB)
Video Watermarking Method Based on 3D U-Net Robust Against Re-shooting
Takaharu TSUBOYAMA Ryota TAKAHASHI Motoi IWATA Koichi KISE

Pubricized:
2024/10/07
PAPER
- Summary
- Free PDF (4MB)
UTStyleCap4K: Generating Image Captions with Sentimental Styles
Chi ZHANG Li TAO Toshihiko YAMASAKI

Pubricized:
2024/10/02
PAPER
- Summary
- Free PDF (2.5MB)
FP-GNN: A Graph Neural Network for Hardware Trojan Detection in Gate-Level Netlist
Ann Jelyn TIEMPO Yong-Jin JEONG

Pubricized:
2024/10/01
LETTER
- Summary
- Free PDF (532.3KB)
Adaptive Merge Candidate Selection based on Geometric Partitioning Mode beyond Versatile Video Coding
Haruhisa KATO Yoshitaka KIDANI Kei KAWAMURA

Pubricized:
2024/09/24
PAPER
- Summary
- Free PDF (4MB)
A Multi-Agent Deep Reinforcement Learning Algorithm for Task offloading in future 6G V2X Network
Jiakun LI Jiajian LI Yanjun SHI Hui LIAN Haifan WU

Pubricized:
2024/09/24
PAPER
- Summary
- Free PDF (1.2MB)
Dalio: In-Kernel Centralized Replication for Key-Value Stores
Gyuyeong KIM

Pubricized:
2024/09/20
LETTER
- Summary
- Free PDF (138.7KB)
Detecting Textual Backdoor Attacks via Class Difference for Text Classification System
Hyun KWON Jun LEE

Pubricized:
2024/09/19
PAPER
- Summary
- Free PDF (680.5KB)
D2PT: Density to Point Transformer with Knowledge Distillation for Crowd Counting and Localization
Fan LI Enze YANG Chao LI Shuoyan LIU Haodong WANG

Pubricized:
2024/09/17
LETTER
- Summary
- Free PDF (1.8MB)
Incremental learning for network traffic classification using generative adversarial networks
Guangjin Ouyang Yong Guo Yu Lu Fang He

Pubricized:
2024/09/13
PAPER
- Summary
- Free PDF (1.3MB)
Multi-Scale Rail Surface Anomaly Detection Based on Weighted Multivariate Gaussian Distribution
Yuyao LIU Qingyong LI Shi BAO Wen WANG

Pubricized:
2024/09/12
PAPER
- Summary
- Free PDF (7.8MB)
BP-CRN: A Lightweight Two-Stage Convolutional Recurrent Network For Multi-channel Speech Enhancement
Cong PANG Ye NI Jia Ming CHENG Lin ZHOU Li ZHAO

Pubricized:
2024/09/10
LETTER
- Summary
- Free PDF (2.6MB)
Building Defect Prediction Models by Online Learning Considering Defect Overlooking
Nikolay FEDOROV Yuta YAMASAKI Masateru TSUNODA Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI

Pubricized:
2024/09/09
LETTER
- Summary
- Free PDF (96.5KB)
The Impact of Defect (Re) Prediction on Software Testing
Yukasa MURAKAMI Yuta YAMASAKI Masateru TSUNODA Akito MONDEN Amjed TAHIR Kwabena Ebo BENNIN Koji TODA Keitaro NAKASAI

Pubricized:
2024/09/09
LETTER
- Summary
- Free PDF (148.4KB)
Deterministic and Probabilistic Certified Defenses for Content-Based Image Retrieval
Kazuya KAKIZAKI Kazuto FUKUCHI Jun SAKUMA

Pubricized:
2024/09/05
PAPER
- Summary
- Free PDF (3.4MB)
Fault-tolerant Routing in Bicubes
Yitong WANG Htoo Htoo Sandi KYAW Kunihiro FUJIYOSHI Keiichi KANEKO

Pubricized:
2024/09/05
PAPER
- Summary
- Free PDF (759.2KB)
Integrating Cyber-Physical Modeling for Pandemic Surveillance: A Graph-Based Approach for Disease Hotspot Prediction and Public Awareness
Waqas NAWAZ Muhammad UZAIR Kifayat ULLAH KHAN Iram FATIMA

Pubricized:
2024/08/29
PAPER
- Summary
- Free PDF (2.2MB)
Real-time Interactions with Photos and Texts in Large Classrooms
Haeyoung Lee

Pubricized:
2024/08/28
LETTER
- Summary
- Free PDF (372.3KB)
CNN-based feature integration network for speech enhancement in microphone arrays
Ji XI Pengxu JIANG Yue XIE Wei JIANG Hao DING

Pubricized:
2024/08/26
LETTER
- Summary
- Free PDF (2MB)
Partial Enhancement and Channel Aggregation for Visible-Infrared Person Re-Identification
Weiwei JING Zhonghua LI

Pubricized:
2024/08/26
PAPER
- Summary
- Free PDF (1.4MB)
Practical APT Group Hash Unit Profiling Framework Using TTPs
Sena LEE Chaeyoung KIM Hoorin PARK

Pubricized:
2024/08/20
LETTER
- Summary
- Free PDF (716.5KB)
Bilaterally Colored Finite Automata and Bilaterally Colored Regular Expressions
Akira ITO Yoshiaki TAKAHASHI

Pubricized:
2024/08/20
PAPER
- Summary
- Free PDF (652.7KB)
Strategies and Equilibria on Indistinguishability of Winning Objectives and Related Decision Problems
Rindo NAKANISHI Yoshiaki TAKATA Hiroyuki SEKI

Pubricized:
2024/08/20
PAPER
- Summary
- Free PDF (1.5MB)
Computational Complexity of Yajisan-Kazusan and Stained Glass
Chuzo IWAMOTO Ryo TAKAISHI

Pubricized:
2024/08/16
PAPER
- Summary
- Free PDF (682.2KB)
A clustering-based deep learning method for water level prediction
Chih-Ping Wang Duen-Ren Liu

Pubricized:
2024/08/14
LETTER
- Summary
- Free PDF (778.9KB)
Stochastic Dual Coordinate Ascent for Learning Sign Constrained Linear Predictors
Yuya TAKADA Rikuto MOCHIDA Miya NAKAJIMA Syun-suke KADOYA Daisuke SANO Tsuyoshi KATO

Pubricized:
2024/08/08
PAPER
- Summary
- Free PDF (483.5KB)
Multi-dimensional and Multi-task Facial Expression Recognition for Academic Outcomes Prediction
Yi Huo Yun Ge

Pubricized:
2024/08/08
LETTER
- Summary
- Free PDF (458.2KB)
Mixup SVM Learning for Compound Toxicity Prediction Using Human Pluripotent Stem Cells
Rikuto MOCHIDA Miya NAKAJIMA Haruki ONO Takahiro ANDO Tsuyoshi KATO

Pubricized:
2024/08/08
LETTER
- Summary
- Free PDF (170.9KB)
A Bigram Based ILP Formulation for Break Minimization in Sports Scheduling Problems
Koichi FUJII Tomomi MATSUI

Pubricized:
2024/08/08
PAPER
- Summary
- Free PDF (1MB)
Dendritic Learning-based Feature Fusion for Deep Networks
Yaotong SONG Zhipeng LIU Zhiming ZHANG Jun TANG Zhenyu LEI Shangce GAO

Pubricized:
2024/08/07
LETTER
- Summary
- Free PDF (289.3KB)
Applying Run-Length Compression to the Configuration Data of SLM Fine-Grained Reconfigurable Logic
Souhei TAKAGI Takuya KOJIMA Hideharu AMANO Morihiro KUGA Masahiro IIDA

Pubricized:
2024/08/07
PAPER
- Summary
- Free PDF (2.6MB)
Imperceptible Trojan Attacks to the Graph-based Big Data Processing in Smart Society
Jun ZHOU Masaaki KONDO

Pubricized:
2024/08/07
PAPER
- Summary
- Free PDF (825.5KB)
Feasibility Study of Applying Spatial Crowd Smoothing Without Economic Incentives on Ticket Reservation System that Applies Nudges
Tetsuya MANABE Wataru UNUMA

Pubricized:
2024/08/05
PAPER
- Summary
- Free PDF (1.3MB)
(15/14)n Flips are (almost) Sufficient to Sort Heydari and Sudborough's Pancake Stack
Kazuyuki AMANO

Pubricized:
2024/08/05
LETTER
- Summary
- Free PDF (288.7KB)
Overlapping of Lattice Unfolding for Cuboids
Takumi SHIOTA Tonan KAMATA Ryuhei UEHARA

Pubricized:
2024/08/05
PAPER
- Summary
- Free PDF (722.9KB)
An FPT Algorithm for the Exact Matching Problem and NP-hardness of Related Problems
Hitoshi MURAKAMI Yutaro YAMAGUCHI

Pubricized:
2024/08/01
PAPER
- Summary
- Free PDF (704.5KB)
Recognition of Vibration Dampers Based on Deep Learning Method in UAV Images
Jingjing Liu Chuanyang Liu Yiquan Wu Zuo Sun

Pubricized:
2024/07/30
PAPER
- Summary
- Free PDF (3.2MB)
Temporal correlation-based end-to-end rate control in DCVC
Zhenglong YANG Weihao DENG Guozhong WANG Tao FAN Yixi LUO

Pubricized:
2024/07/29
LETTER
- Summary
- Free PDF (574.4KB)
A Subclass of Mu-Calculus with the Freeze Quantifier Equivalent to Büchi Register Automata
Yoshiaki TAKATA Akira ONISHI Ryoma SENDA Hiroyuki SEKI

Pubricized:
2024/07/26
LETTER
- Summary
- Free PDF (137.6KB)
Degraded image classification using knowledge distillation and robust data augmentations
Dinesh DAULTANI Masayuki TANAKA Masatoshi OKUTOMI Kazuki ENDO

Pubricized:
2024/07/26
PAPER
- Summary
- Free PDF (4.4MB)
Escape from the Room
Kento KIMURA Tomohiro HARAMIISHI Kazuyuki AMANO Shin-ichi NAKANO

Pubricized:
2024/07/11
PAPER
- Summary
- Free PDF (1.1MB)
Online combinatorial linear optimization via a Frank-Wolfe-based metarounding algorithm
Ryotaro MITSUBOSHI Kohei HATANO Eiji TAKIMOTO

Pubricized:
2024/07/11
PAPER
- Summary
- Free PDF (1.1MB)
A Flip-count-based Dynamic Temperature Control Method for Constrained Combinatorial Optimization by Parallel Annealing Algorithms
Genta INOUE Daiki OKONOGI Satoru JIMBO Thiem Van CHU Masato MOTOMURA Kazushi KAWAMURA

Pubricized:
2024/07/11
PAPER
- Summary
- Free PDF (1.7MB)
Performance evaluation of CAIN model frame interpolation using training data limited by fixed camera scene detection
Hikaru USAMI Yusuke KAMEDA

Pubricized:
2024/07/11
LETTER
- Summary
- Free PDF (708.9KB)
Towards Superior Pruning Performance in Federated Learning with Discriminative Data
Yinan YANG

Pubricized:
2024/06/27
- Summary
- Free PDF (7.9MB)
Design and implementation of opto-electrical hybrid floating-point multipliers
Takumi INABA Takatsugu ONO Koji INOUE Satoshi KAWAKAMI

Pubricized:
2024/06/26
- Summary
- Free PDF (2.5MB)
HDR-VDA: A Full Stage Data Augmentation Method for HDR Video Reconstruction
Fengshan ZHAO Qin LIU Takeshi IKENAGA

Pubricized:
2024/06/17
- Summary
- Free PDF (1.2MB)
Space-efficient FPT Algorithms for Degeneracy
Naohito MATSUMOTO Kazuhiro KURITA Masashi KIYOMI

Pubricized:
2024/05/31
- Summary
- Free PDF (101.3KB)
The Least Core of Routing Game Without Triangle Inequality
Tomohiro KOBAYASHI Tomomi MATSUI

Pubricized:
2024/05/30
- Summary
- Free PDF (232.9KB)
Enumerating floorplans with Aligned Columns
Shin-ichi NAKANO

Pubricized:
2024/05/30
- Summary
- Free PDF (365.4KB)
An IP Core Protection Scheme Based on Hybrid Lightweight Encryption for Neuromorphic Computing System
Ming PAN

The aritcle processing charge of this paper has not been paid.

Pubricized:
2022/09/14
- Summary

Whole issue

Volume E78-D No.6 (Publication Date:1995/06/25)

Special Issue on Spoken Language Processing

FOREWORD
Kazuyo TANAKA

FOREWORD

Page(s):
607-608
- HTML
- PDF(92.3KB) >> Buy this Article
Multimodal Interaction in Human Communication
Keiko WATANUKI Kenji SAKAMOTO Fumio TOGAWA

PAPER

Page(s):
609-615
We are developing multimodal man-machine interfaces through which users can communicate by integrating speech, gaze, facial expressions, and gestures such as nodding and finger pointing. Such multimodal interfaces are expected to provide more flexible, natural and productive communications between humans and computers. To achieve this goal, we have taken the approach of modeling human behavior in the context of ordinary face-to-face conversations. As the first step, we have implemented a system which utilizes video and audio recording equipment to capture verbal and nonverbal information in interpersonal communications. Using this system, we have collected data from a task-oriented conversation between a guest (subject) and a receptionist at company reception desk, and quantitatively analyzed this data with respect to multi-modalities which would be functional in fluid interactions. This paper presents detailed analyses of the data collected: (1) head nodding and eye-contact are related to the beginning and end of speaking turns, acting to supplement speech information; (2) listener responses occur after an average of 0.35 sec. from the receptionist's utterance of a keyword, and turn-taking for tag-questions occurs after an average of 0.44 sec.; and (3) there is a rhythmical coordination between speakers and listeners.
A Speech Dialogue System with Multimodal Interface for Telephone Directory Assistance
Osamu YOSHIOKA Yasuhiro MINAMI Kiyohiro SHIKANO

PAPER

Page(s):
616-621
This paper describes a multimodal dialogue system employing speech input. This system uses three input methods (through a speech recognizer, a mouse, and a keyboard) and two output methods (through a display and using sound). For the speech recognizer, an algorithm is employed for large-vocabulary speaker-independent continuous speech recognition based on the HMM-LR technique. This system is implemented for telephone directory assistance to evaluate the speech recognition algorithm and to investigate the variations in speech structure that users utter to computers. Speech input is used in a multimodal environment. The collecting of dialogue data between computers and users is also carried out. Twenty telephone-number retrieval tasks are used to evaluate this system. In the experiments, all the users are equally trained in using the dialogue system with an interactive guidance system implemented on a workstation. Simplified city maps that indicate subscriber names and addresses are used to reduce the implicit restrictions imposed by written sentences, thus allowing each user to develop his own forms of expression. The task completion rate is 99.0% and approximately 75% of the users say that they prefer this system to using a telephone book. Moreover, there is a significant decrease in nonkeyword usage, i.e., the usage of words other than names and addresses, for users who receive more utterance practice.
An Utterance Prediction Method Based on the Topic Transition Model
Yoichi YAMASHITA Takashi HIRAMATSU Osamu KAKUSHO Riichiro MIZOGUCHI

PAPER

Page(s):
622-628
This paper describes a method for predicting the user's next utterances in spoken dialog based on the topic transition model, named TPN. Some templates are prepared for each utterance pair pattern modeled by SR-plan. They are represented in terms of five kinds of topic-independent constituents in sentences. The topic of an utterance is predicted based on the TPN model and it instantiates the templates. The language processing unit analyzes the speech recognition result using the templates. An experiment shows that the introduction of the TPN model improves the performance of utterance recognition and it drastically reduces the search space of candidates in the input bunsetsu lattice.
Cooperative Spoken Dialogue Model Using Bayesian Network and Event Hierarchy
Masahiro ARAKI Shuji DOSHITA

PAPER

Page(s):
629-635
In this paper, we propose a dialogue model that reflects two important aspects of spoken dialogue system: to be robust' and to be cooperative'. For this purpose, our model has two main inference spaces: Conversational Space (CS) and Problem Solving Space (PSS). CS is a kind of dynamic Bayesian network that represents a meaning of utterance and general dialogue rule. Robust' aspect is treated in CS. PSS is a network so called Event Hierarchy that represents the structure of task domain problems. Cooperative' aspect is mainly treated in PSS. In constructing CS and making inference on PSS, system's process, from meaning understanding through response generation, is modeled by dividing into five steps. These steps are (1) meaning understanding, (2) intention understanding, (3) communicative effect, (4) reaction generation, and (5) response generation. Meaning understanding step constructs CS and response generation step composes a surface expression of system's response from the part of CS. Intention understanding step makes correspondence utterance type in CS with action in PSS. Reaction generation step selects a cooperative reaction in PSS and expands a reaction to utterance type of CS. The status of problem solving and declared user's preference are recorded in mental state by communicative effect step. Then from our point of view, cooperative problem solving dialogue is regarded as a process of constructing CS and achieving goal in PSS through these five steps.
Error Analysis of Field Trial Results of a Spoken Dialogue System for Telecommunications Applications
Shingo KUROIWA Kazuya TAKEDA Masaki NAITO Naomi INOUE Seiichi YAMAMOTO

PAPER

Page(s):
636-641
We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real time using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.
Automatic Determination of the Number of Mixture Components for Continuous HMMs Based a Uniform Variance Criterion
Tetsuo KOSAKA Shigeki SAGAYAMA

PAPER

Page(s):
642-647
We discuss how to determine automatically the number of mixture components in continuous mixture density HMMs (CHMMs). A notable trend has been the use of CHMMs in recent years. One of the major problems with a CHMM is how to determine its structure, that is, how many mixture components and states it has and its optimal topology. The number of mixture components has been determined heuristically so far. To solve this problem, we first investigate the influence of the number of mixture components on model parameters and the output log likelihood value. As a result, in contrast to the mixture number uniformity" which is applied in conventional approaches to determine the number of mixture components, we propose the principle of distribution size uniformity". An algorithm is introduced for automatically determining the number of mixture components. The performance of this algorithm is shown through recognition experiments involving all Japanese phonemes. Two types of experiments are carried out. One assumes that the number of mixture components for each state is the same within a phonetic model but may vary between states belonging to different phonemes. The other assumes that each state has a variable number of mixture components. These two experiments give better results than the conventional method.
An HMM State Duration Control Algorithm Applied to Large-Vocabulary Spontaneous Speech Recognition
Satoshi TAKAHASHI Yasuhiro MINAMI Kiyohiro SHIKANO

PAPER

Page(s):
648-653
Although Hidden Markov Modeling (HMM) is widely and successfully used in many speech recognition applications, duration control for HMMs is still an important issue in improving recognition accuracy since a HMM places no constraints on duration. For compensating this defect, some duration control algorithms that employ precise duration models have been proposed. However, they suffer from greatly increased computational complexity. This paper proposes a new state duration control algorithm for limiting both the maximum and the minimum state durations. The algorithm is for the HMM trellis likelihood calculation, not for the Viterbi calculation. The amount of computation required by this algorithm is only order one (O(1)) for the maximum state duration n; that is, the computation amount is independent of the maximum state duration while many conventional duration control algorithm require computation in the amount of order n or order n². Thus, the algorithm can drastically reduce the computation needed for duration control. The algorithm uses the property that the trellis likelihood calculation is a summation of many path likelihoods. At each frame, the path likelihood that exceeds the maximum likelihood is subtracted, and the path likelihood that satisfies the minimum likelihood is added to the forward probability. By iterating this procedure, the algorithm calculates the trellis likelihood efficiently. The algorithm was evaluated using a large-vocabulary speaker-independent spontaneous speech recognition system for telephone directory assistance. The average reduction in error rate for sentence understanding was about 7% when using context-independent HMMs, and 3% when using context-dependent HMMs. We could confirm the improvement by using the proposed state duration control algorithm even though the maximum and the minimum state durations were not optimized for the task (speaker-independent duration settings obtained from a different task were used).
Duration Modeling with Decreased Intra-Group Temporal Variation for HMM-Based Phoneme Recognition
Nobuaki MINEMATSU Keikichi HIROSE

PAPER

Page(s):
654-661
A new clustering method was proposed to increase the effect of duration modeling on the HMM-based phoneme recognition. A precise observation on the temporal correspondences between a phoneme HMM with output probabilities by single Gaussian modeling and its training data indicated that there were two extreme cases, one with several types of correspondences in a phoneme class completely different from each other, and the other with only one type of correspondence. Although duration modeling was commonly used to incorporate the temporal information in the HMMs, a good modeling could not be obtained for the former case. Further observation for phoneme HMMs with output probabilities by Gaussian mixture modeling also showed that some HMMs still had multiple temporal correspondences, though the number of such phonemes was reduced as compared to the case of single Gaussian modeling. An appropriate duration modeling cannot be obtained for these phoneme HMMs by the conventional methods, where the duration distribution for each HMM state is represented by a distribution function. In order to cope with the problem, a new method was proposed which was based on the clustering of phoneme classes with plural types of temporal correspondences into sub-classes. The clustering was conducted so as to reduce the variations of the temporal correspondences in sub-classes. After the clustering, an HMM was constructed for each sub-class. Using the proposed method, speaker dependent recognition experiments were performed for phonemes segmented from isolated words. A few-percent increase was realized in the recognition rate, which was not obtained by another method based on the duration modeling with a Gaussian mixture.
A New HMnet Construction Algorithm Requiring No Contextual Factors
Motoyuki SUZUKI Shozo MAKINO Akinori ITO Hirotomo ASO Hiroshi SHIMODAIRA

PAPER

Page(s):
662-668
Many methods have been proposed for constructing context-dependent phoneme models using Hidden Markov Models (HMMs) to improve performance. These conventional methods require previously defined contextual factors. If these factors are deficient, the method exhibit poor recognition performance. In this paper, we propose a new construction algorithm for HMnet which does not require pre-defined contextual factors. Experiments demonstrated that the new algorithm could construct the HMnet even for the case that the Successive State Splitting (SSS) algorithm could not. The new algorithm produced better phoneme recognition characteristics than the SSS algorithm.
A Comparative Study of Output Probability Functions in HMMs
Seiichi NAKAGAWA Li ZHAO Hideyuki SUZUKI

PAPER

Page(s):
669-675
One of the most effective methods in speech recognition is the HMM which has been used to model speech statistically. The discrete distribution and the continuos distribution HMMs have been widely used in various applications. However, in recent years, HMMs with various output probability functions have been proposed to further improve recognition performance, e.g. the Gaussian mixture continuous and the semi-continuous distributed HMMs. We recently have also proposed the RBF (radial basis function)-based HMM and the VQ-distortion based HMM which use a RBF function and VQ-distortion measure at each state instead of an output probability density function used by traditional HMMs. In this paper, we describe the RBF-based HMM and the VQ-distortion based HMM and compare their performance with the discrete distributed, the Gaussian mixture distributed and the semi-continuous distributed HMMs based on their speech recognition performance rates through experiments on speaker-independent spoken digit recognition. Our results confirmed that the RBF-based and VQ-distortion based HMMs are more robust and superior to traditional HMMs.
Neural Predictive Hidden Markov Model for Speech Recognition
Eiichi TSUBOKA Yoshihiro TAKADA

PAPER

Page(s):
676-684
This paper describes new modeling methods combining neural network and hidden Markov model applicable to modeling a time series such as speech signal. The idea assumes that the sequence is nonstationary and is a nonlinear autoregressive process whose parameters are controlled by a hidden Markov chain. One is the model where a non-linear predictor composed of a multi-layered neural network is defined at each state, another is the model where a multi-layered neural network is defined so that the path from the input layer to the output layer is divided into path-groups each of which corresponds to the state of the Markov chain. The latter is an extended model of the former. The parameter estimation methods for these models are shown, and other previously proposed models--one called Neural Prediction Model and another called Linear Predictive HMM--are shown to be special cases of the NPHMM proposed here. The experimental result affirms the justification of these proposed models.
Tone Recognition of Chinese Dissyllables Using Hidden Markov Models
Xinhui HU Keikichi HIROSE

PAPER

Page(s):
685-691
A method of tone recognition has been developed for dissyllabic speech of Standard Chinese based on discrete hidden Markov modeling. As for the feature parameters of recognition, combination of macroscopic and microscopic parameters of fundamental frequency contours was shown to give a better result as compared to the isolated use of each parameter. Speaker normalization was realized by introducing an offset to the fundamental frequency. In order to avoid recognition errors due to syllable segmentation, a scheme of concatenated learning was adopted for training hidden Markov models. Based on the observations of fundamental frequency contours of dissyllables, a scheme was introduced to the method, where a contour was represented with a series of three syllabic tone models, two for the first and the second syllables and one for the transition part around the syllabic boundary. Corresponding to the voiceless consonant of the second syllable, fundamental frequency contour of a dissyllable may include a part without fundamental frequencies. This part was linearly interpolated in the current method. To prove the validity of the proposed method, it was compared with other methods, such as representing all of the dissyllabic contours as the concatenation of two models, assigning a special code to the voiceless part, and so on. Tone sandhi was also taken into account by introducing two additional models for the half-third tone and for the first 4th tone of the combination of two 4th tones. With the proposed method, average recognition rate of 96% was achieved for 5 male and 5 female speakers.
Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams
Ryosuke ISOTANI Shoichi MATSUNAGA Shigeki SAGAYAMA

PAPER

Page(s):
692-697
This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.
Relationship among Recognition Rate, Rejection Rate and False Alarm Rate in a Spoken Word Recognition System
Atsuhiko KAI Seiichi NAKAGAWA

PAPER

Page(s):
698-704
Detection of an unknown word or non-vocabulary word uttered by the user is necessary in realizing a practical spoken language user-interface. This paper describes the evaluation of an unknown word processing method for a subword unit based spoken word recognizer. We have assessed the relationship between the word recognition accuracy of a system and the detection rate of unknown words both by simulation and by experiment of the unknown word processing method. We found that the resultant detection accuracies using the unknown word processing are significantly influenced by the original word recognition accuracy while the degree of such effect depends on the vocabulary size.
Automatic Language Identification Using Sequential Information of Phonemes
Takayuki ARAI

PAPER

Page(s):
705-711
In this paper approaches to language identification based on the sequential information of phonemes are described. These approaches assume that each language can be identified from its own phoneme structure, or phonotactics. To extract this phoneme structure, we use phoneme classifiers and grammars for each language. The phoneme classifier for each language is implemented as a multi-layer perceptron trained on quasi-phonetic hand-labeled transcriptions. After training the phoneme classifiers, the grammars for each language are calculated as a set of transition probabilities for each phoneme pair. Because of the interest in automatic language identification for worldwide voice communication, we decided to use telephone speech for this study. The data for this study were drawn from the OGI (Oregon Graduate Institute)-TS (telephone speech) corpus, a standard corpus for this type of research. To investigate the basic issues of this approach, two languages, Japanese and English, were selected. The language classification algorithms are based on Viterbi search constrained by a bigram grammar and by minimum and maximum durations. Using a phoneme classifier trained only on English phonemes, we achieved 81.1% accuracy. We achieved 79.3% accuracy using a phoneme classifier trained on Japanese phonemes. Using both the English and the Japanese phoneme classifiers together, we obtained our best result: 83.3%. Our results were comparable to those obtained by other methods such as that based on the hidden Markov model.
A Study on Speaker Adaptation for Mandarin Syllable Recognition with Minimum Error Discriminative Training
Chih-Heng LIN Chien-Hsing WU Pao-Chung CHANG

PAPER

Page(s):
712-718
This paper investigates a different method of speaker adaptation for Mandarin syllable recognition. Based on the minimum classification error (MCE) criterion, we use the generalized probabilistic decent (GPD) algorithm to adjust interatively the parameters of the hidden Markov models (HMM). The experiments on the multi-speaker Mandarin syllable database of Telecommunication Laboratories (T.L.) yield the following results: 1) Efficient speaker adaptation can be achieved through discriminative training using the MCE criterion and the GPD algorithm. 2) The computations required can be reduced through the use of the confusion sets in Mandarin base syllables. 3) For the discriminative training, the adjustment on the mean values of the Gaussian mixtures has the most prominent effect on speaker adaptation. 4) The discriminative training approach can be used to enhance the speaker adaptation capability of the maximum a posteriori (MAP) approach.
Speaker-Consistent Parsing for Speaker-Independent Continuous Speech Recognition
Kouichi YAMAGUCHI Harald SINGER Shoichi MATSUNAGA Shigeki SAGAYAMA

PAPER

Page(s):
719-724
This paper describes a novel speaker-independent speech recognition method, called speaker-consistent parsing", which is based on an intra-speaker correlation called the speaker-consistency principle. We focus on the fact that a sentence or a string of words is uttered by an individual speaker even in a speaker-independent task. Thus, the proposed method searches through speaker variations in addition to the contents of utterances. As a result of the recognition process, an appropriate standard speaker is selected for speaker adaptation. This new method is experimentally compared with a conventional speaker-independent speech recognition method. Since the speaker-consistency principle best demonstrates its effect with a large number of training and test speakers, a small-scale experiment may not fully exploit this principle. Nevertheless, even the results of our small-scale experiment show that the new method significantly outperforms the conventional method. In addition, this framework's speaker selection mechanism can drastically reduce the likelihood map computation.
A Scheme for Word Detection in Continuous Speech Using Likelihood Scores of Segments Modified by Their Context Within a Word
Sumio OHNO Keikichi HIROSE Hiroya FUJISAKI

PAPER

Page(s):
725-731
In conventional word-spotting methods for automatic recognition of continuous speech, individual frames or segments of the input speech are assigned labels and local likelihood scores solely on the basis of their own acoustic characteristics. On the other hand, experiments on human speech perception conducted by the present authors and others show that human perception of words in connected speech is based, not only on the acoustic characteristics of individual segments, but also on the acoustic and linguistic contexts in which these segments occurs. In other words, individual segments are not correctly perceive by humans unless they are accompanied by their context. These findings on the process of human speech perception have to be applied in automatic speech recognition in order to improve the performance. From this point of view, the present paper proposes a new scheme for detecting words in continuous speech based on template matching where the likelihood of each segment of a word is determined not only by its own characteristics but also by the likelihood of its context within the framework of a word. This is accomplished by modifying the likelihood score of each segment by the likelihood score of its phonetic context, the latter representing the degree of similarity of the context to that of a candidate word in the lexicon. Higher enhancement is given to the segmental likelihood score if the likelihood score of its context is higher. The advantage of the proposed scheme over conventional schemes is demonstrated by an experiment on constructing a word lattice using connected speech of Japanese uttered by a male speaker. The result indicates that the scheme is especially effective in giving correct recognition in cases where there are two or more candidate words which are almost equal in raw segmental likelihood scores.
Uniform and Non-uniform Normalization of Vocal Tracts Measured by MRI Across Male, Female and Child Subjects
Chang-Sheng YANG Hideki KASUYA

PAPER

Page(s):
732-737
Three-dimensional vocal tract shapes of a male, a female and a child subjects are measured from magnetic resonance (MR) images during sustained phonation of Japanese vowels /a, i, u, e, o/. Non-uniform dimensional differences in the vocal tract shapes of the subjects are quantitatively measured. Vocal tract area functions of the female and child subjects are normalized to those of the male on the basis of non-uniform and uniform scalings of the vocal tract length and compared with each other. A comparison is also made between the formant frequencies computed from the area functions normalized by the two different scalings. It is suggested by the comparisons that non-uniformity in the vocal tract dimensions is not essential in the normalization of the five Japanese vowels.
Simultaneous Estimation of Vocal Tract and Voice Source Parameters Based on an ARX Model
Wen DING Hideki KASUYA Shuichi ADACHI

PAPER

Page(s):
738-743
A novel adaptive pitch-synchronous analysis method is proposed to estimate simultaneously vocal tract (formant/antiformant) and voice source parameters from speech waveforms. We use the parametric Rosenberg-Klatt (RK) model to generate a glottal waveform and an autoregressive-exogenous (ARX) model to represent voiced speech production process. The Kalman filter algorithm is used to estimate the formant/antiformant parameters from the coefficient of the ARX model, and the simulated annealing method is employed as a nonlinear optimization approach to estimate the voice source parameters. The two approaches work together in a system identification procedure to find the best set of the parameters of both the models. The new method has been compared using synthetic speech with some other approaches in terms of accuracy of estimated parameter values and has been proved to be superior. We also show that the proposed method can estimate accurately the parameters from natural speech sounds. A major application of the analysis method lies in a concatenative formant synthesizer which allows us to make flexible control of voice quality of synthetic speech.
Characteristics of Multi-Layer Perceptron Models in Enhancing Degraded Speech
Thanh Tung LE John MASON Tadashi KITAMURA

PAPER

Page(s):
744-750
A multi-layer perceptron (MLP) acting directly in the time-domain is applied as a speech signal enhancer, and the performance examined in the context of three common classes of degradation, namely low bit-rate CELP degradation is non-linear system degradation, additive noise, and convolution by a linear system. The investigation focuses on two topics: (i) the influence of non-linearities within the network and (ii) network topology, comparing single and multiple output structures. The objective is to examine how these characteristics influence network performance and whether this depends on the class of degradation. Experimental results show the importance of matching the enhancer to the class of degradation. In the case of the CELP coder the standard MLP with its inherently non-linear characteristics is shown to be consistently better than any equivalent linear structure (up to 3.2 dB compared with 1.6 dB SNR improvement). In contrast, when the degradation is from additive noise, a linear enhancer is always, superior.
An Objective Measure Based on an Auditory Model for Assessing Low-Rate Coded Speech
Toshiro WATANABE Shinji HAYASHI

PAPER

Page(s):
751-757
We propose an objective measure from assessing low-rate coded speech. The model for this objective measure, in which several known features of the perceptual processing of speech sounds by the human ear are emulated, is based on the Hertz-to-Bark transformation, critical-band filtering with preemphasis to boost higher frequencies, nonlinear conversion for subjective loudness, and temporal (forward) masking. The effectiveness of the measure, called the Bark spectral distortion rating (BSDR), was validated by second-order polynomial regression analysis between the computed BSDR values and subjective MOS ratings obtained for a large number of utterances coded by several versions of CELP coders and one VSELP coder under three degradation conditions: input speech levels, transmission error rates, and background noise levels. The BSDR values correspond better to MOS ratings than several commonly used measures. Thus, BSDR can be used to accurately predict subjective scores.
4 kbps Improved Pitch Prediction CELP Speech Coding with 20 msec Frame
Masahiro SERIZAWA Kazunori OZAWA

PAPER

Page(s):
758-763
This paper proposes a new pitch prediction method for 4 kbps CELP (Code Excited LPC) speech coding with 20 msec frame, for the future ITU-T 4 kbps speech coding standardization. In the conventional CELP speech coding, synthetic speech quality deteriorates rapidly at 4 kbps, especially for female and children's speech with short pitch period. The pitch prediction performance is significantly degraded for such speech. The important reason is that when the pitch period is shorter than the subframe length, the simple repetition of the past excitation signal based on the estimated lag, not the pitch prediction, is usually carried out in the adaptive codebook operation. The proposed pitch prediction method can carry out the pitch prediction without the above approximation by utilizing the current subframe excitation codevector signal, when the pitch prediction parameters are determined. To further improve the performance, a split vector synthesis and perceptually spectral weighting method, and a low-complexity perceptually harmonic and spectral weighting method have also been developed. The informal listening test result shows that the 4 kbps speech coder with 20 msec frame, utilizing all of the proposed improvements, achieves 0.2 MOS higher results than the coder without them.

Regular Section

A Structured Video Handling Technique for Multimedia Systems
Yoshinobu TONOMURA Akihito AKUTSU

PAPER-Image Processing, Computer Graphics and Pattern Recognition

Page(s):
764-777
This paper proposes a functional video handling technique based on structured video. The video handling architecture, which includes a video data structure, file management structure, and visual interface structure, is introduced as the core concept of this technique. One of the key features of this architecture is that the newly proposed video indexing method is performed automatically based on image processing. The video data structure, which plays an important role in the architecture, has two kinds of data structures: content and node. The central idea behind these structures is to separate the video contents from the processing operations and to create links between them. Video indexes work as a backend mechanism in structuring video content. A prototype video handling system called the MediaBENCH, a hypermedia basic environment for computer and human interactions, which demonstrates the actual implementation of the proposed concept and technique, is described. Basic functions such as browsing and editing, which are achieved based on the architecture, exhibit the advantages of structured video handling. The concept and the methods proposed in this paper assure various video-computer applications, which will play major roles in the multimedia field.
A Note on One-way Auxiliary Pushdown Automata
Yue WANG Jian-Liang XU Katsushi INOUE Akira ITO

LETTER-Automata, Languages and Theory of Computing

Page(s):
778-782
This paper establishes a relationship among the accepting powers of deterministic, nondeterministic, and alternating one-way auxiliary pushdown automata, for any tape bound below n. Some other related results are also presented.

IEICE TRANSACTIONS on Information

Advance publication (published online immediately after acceptance)

Volume E78-D No.6 (Publication Date:1995/06/25)

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles