1-1hit |
Sumio OHNO Keikichi HIROSE Hiroya FUJISAKI
In conventional word-spotting methods for automatic recognition of continuous speech, individual frames or segments of the input speech are assigned labels and local likelihood scores solely on the basis of their own acoustic characteristics. On the other hand, experiments on human speech perception conducted by the present authors and others show that human perception of words in connected speech is based, not only on the acoustic characteristics of individual segments, but also on the acoustic and linguistic contexts in which these segments occurs. In other words, individual segments are not correctly perceive by humans unless they are accompanied by their context. These findings on the process of human speech perception have to be applied in automatic speech recognition in order to improve the performance. From this point of view, the present paper proposes a new scheme for detecting words in continuous speech based on template matching where the likelihood of each segment of a word is determined not only by its own characteristics but also by the likelihood of its context within the framework of a word. This is accomplished by modifying the likelihood score of each segment by the likelihood score of its phonetic context, the latter representing the degree of similarity of the context to that of a candidate word in the lexicon. Higher enhancement is given to the segmental likelihood score if the likelihood score of its context is higher. The advantage of the proposed scheme over conventional schemes is demonstrated by an experiment on constructing a word lattice using connected speech of Japanese uttered by a male speaker. The result indicates that the scheme is especially effective in giving correct recognition in cases where there are two or more candidate words which are almost equal in raw segmental likelihood scores.