The search functionality is under construction.

Author Search Result

[Author] Eiichiro SUMITA(19hit)

1-19hit
  • Introducing a Translation Dictionary into Phrase-Based SMT

    Hideo OKUMA  Hirofumi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E91-D No:7
      Page(s):
    2051-2057

    This paper presents a method to effectively introduce a translation dictionary into phrase-based SMT. Though SMT systems can be built with only a parallel corpus, translation dictionaries are more widely available and have many more entries than parallel corpora. A simple and low-cost method to introduce a translation dictionary is to attach a dictionary entry into a phrase table. This, however, does not work well. Target word order and even whole target sentences are often incorrect. To solve this problem, the proposed method uses high-frequency words in the training corpus. The high-frequency words may already be trained well; in other words, they may appear in the phrase table and therefore be translated with correct word order. Experimental results show the proposed method as far superior to simply attaching dictionary entries into phrase tables.

  • Development of the “VoiceTra” Multi-Lingual Speech Translation System Open Access

    Shigeki MATSUDA  Teruaki HAYASHI  Yutaka ASHIKARI  Yoshinori SHIGA  Hidenori KASHIOKA  Keiji YASUDA  Hideo OKUMA  Masao UCHIYAMA  Eiichiro SUMITA  Hisashi KAWAI  Satoshi NAKAMURA  

     
    INVITED PAPER

      Pubricized:
    2017/01/13
      Vol:
    E100-D No:4
      Page(s):
    621-632

    This study introduces large-scale field experiments of VoiceTra, which is the world's first speech-to-speech multilingual translation application for smart phones. In the study, approximately 10 million input utterances were collected since the experiments commenced. The usage of collected data was analyzed and discussed. The study has several important contributions. First, it explains system configuration, communication protocol between clients and servers, and details of multilingual automatic speech recognition, multilingual machine translation, and multilingual speech synthesis subsystems. Second, it demonstrates the effects of mid-term system updates using collected data to improve an acoustic model, a language model, and a dictionary. Third, it analyzes system usage.

  • A Reordering Model Using a Source-Side Parse-Tree for Statistical Machine Translation

    Kei HASHIMOTO  Hirofumi YAMAMOTO  Hideo OKUMA  Eiichiro SUMITA  Keiichi TOKUDA  

     
    PAPER-Machine Translation

      Vol:
    E92-D No:12
      Page(s):
    2386-2393

    This paper presents a reordering model using a source-side parse-tree for phrase-based statistical machine translation. The proposed model is an extension of IST-ITG (imposing source tree on inversion transduction grammar) constraints. In the proposed method, the target-side word order is obtained by rotating nodes of the source-side parse-tree. We modeled the node rotation, monotone or swap, using word alignments based on a training parallel corpus and source-side parse-trees. The model efficiently suppresses erroneous target word orderings, especially global orderings. Furthermore, the proposed method conducts a probabilistic evaluation of target word reorderings. In English-to-Japanese and English-to-Chinese translation experiments, the proposed method resulted in a 0.49-point improvement (29.31 to 29.80) and a 0.33-point improvement (18.60 to 18.93) in word BLEU-4 compared with IST-ITG constraints, respectively. This indicates the validity of the proposed reordering model.

  • A Speech Translation System Applied to a Real-World Task/Domain and Its Evaluation Using Real-World Speech Data

    Atsushi NAKAMURA  Masaki NAITO  Hajime TSUKADA  Rainer GRUHN  Eiichiro SUMITA  Hideki KASHIOKA  Hideharu NAKAJIMA  Tohru SHIMIZU  Yoshinori SAGISAKA  

     
    PAPER-Speech and Hearing

      Vol:
    E84-D No:1
      Page(s):
    142-154

    This paper describes an application of a speech translation system to another task/domain in the real-world by using developmental data collected from real-world interactions. The total cost for this task-alteration was calculated to be 9 Person-Month. The newly applied system was also evaluated by using speech data collected from real-world interactions. For real-world speech having a machine-friendly speaking style, the newly applied system could recognize typical sentences with a word accuracy of 90% or better. We also found that, concerning the overall speech translation performance, the system could translate about 80% of the input Japanese speech into acceptable English sentences.

  • Constraining a Generative Word Alignment Model with Discriminative Output

    Chooi-Ling GOH  Taro WATANABE  Hirofumi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E93-D No:7
      Page(s):
    1976-1983

    We present a method to constrain a statistical generative word alignment model with the output from a discriminative model. The discriminative model is trained using a small set of hand-aligned data that ensures higher precision in alignment. On the other hand, the generative model improves the recall of alignment. By combining these two models, the alignment output becomes more suitable for use in developing a translation model for a phrase-based statistical machine translation (SMT) system. Our experimental results show that the joint alignment model improves the translation performance. The improvement in average of BLEU and METEOR scores is around 1.0-3.9 points.

  • Imposing Constraints from the Source Tree on ITG Constraints for SMT

    Hirofumi YAMAMOTO  Hideo OKUMA  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:9
      Page(s):
    1762-1770

    In the current statistical machine translation (SMT), erroneous word reordering is one of the most serious problems. To resolve this problem, many word-reordering constraint techniques have been proposed. Inversion transduction grammar (ITG) is one of these constraints. In ITG constraints, target-side word order is obtained by rotating nodes of the source-side binary tree. In these node rotations, the source binary tree instance is not considered. Therefore, stronger constraints for word reordering can be obtained by imposing further constraints derived from the source tree on the ITG constraints. For example, for the source word sequence { a b c d }, ITG constraints allow a total of twenty-two target word orderings. However, when the source binary tree instance ((a b) (c d)) is given, our proposed "imposing source tree on ITG" (IST-ITG) constraints allow only eight word orderings. The reduction in the number of word-order permutations by our proposed stronger constraints efficiently suppresses erroneous word orderings. In our experiments with IST-ITG using the NIST MT08 English-to-Chinese translation track's data, the proposed method resulted in a 1.8-points improvement in character BLEU-4 (35.2 to 37.0) and a 6.2% lower CER (74.1 to 67.9%) compared with our baseline condition.

  • Splitting Input for Machine Translation Using N-gram Language Model Together with Utterance Similarity

    Takao DOI  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E88-D No:6
      Page(s):
    1256-1264

    In order to boost the translation quality of corpus-based MT systems for speech translation, the technique of splitting an input utterance appears promising. In previous research, many methods used word-sequence characteristics like N-gram clues among splitting positions. In this paper, to supplement splitting methods based on word-sequence characteristics, we introduce another clue using similarity based on edit-distance. In our splitting method, we generate candidates for utterance splitting based on N-grams, and select the best one by measuring the utterance similarity against a corpus. This selection is founded on the assumption that a corpus-based MT system can correctly translate an utterance that is similar to an utterance in its training corpus. We conducted experiments using three MT systems: two EBMT systems, one of which uses a phrase as a translation unit and the other of which uses an utterance, and an SMT system. The translation results under various conditions were evaluated by objective measures and a subjective measure. The experimental results demonstrate that the proposed method is valuable for the three systems. Using utterance similarity can improve the translation quality.

  • Training Set Selection for Building Compact and Efficient Language Models

    Keiji YASUDA  Hirofumi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E92-D No:3
      Page(s):
    506-511

    For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.

  • Automatic Induction of Romanization Systems from Bilingual Corpora

    Keiko TAGUCHI  Andrew FINCH  Seiichi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2014/11/14
      Vol:
    E98-D No:2
      Page(s):
    381-393

    In this article we present a novel corpus-based method for inducing romanization systems for languages through a bilingual alignment of transliteration word pairs. First, the word pairs are aligned using a non-parametric Bayesian approach, and then for each grapheme sequence to be romanized, a particular romanization is selected according to a user-specified criterion. As far as we are aware, this paper is the only one to describe a method for automatically deriving complete romanization systems. Unlike existing human-derived romanization systems, the proposed method is able to discover induced romanization systems tailored for specific purposes, for example, for use in data mining, or efficient user input methods. Our experiments study the romanization of four totally different languages: Russian, Japanese, Hindi and Myanmar. The first two languages already have standard romanization systems in regular use, Hindi has a large number of diverse systems, and Myanmar has no standard system for romanization. We compare our induced romanization system to existing systems for Russian and Japanese. We find that the systems so induced are almost identical to Russian, and 69% identical to Japanese. We applied our approach to the task of transliteration mining, and used Levenshtein distance as the romanization selection criterion. Our experiments show that our induced romanization system was able to match the performance of the human created system for Russian, and offer substantially improved mining performance for Japanese. We provide an analysis of the mechanism our approach uses to improve mining performance, and also analyse the differences in characteristics between the induced system for Japanese and the official Japanese Nihon-shiki system. In order to investigate the limits of our approach, we studied the romanization of Myanmar, a low-resource language with a large vocabulary of graphemes. We estimate the approximate corpus size required to effectively romanize the most frequency k graphemes in the language for all values of k up to 1800.

  • Japanese Argument Reordering Based on Dependency Structure for Statistical Machine Translation

    Chooi-Ling GOH  Taro WATANABE  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E95-D No:6
      Page(s):
    1668-1675

    While phrase-based statistical machine translation systems prefer to translate with longer phrases, this may cause errors in a free word order language, such as Japanese, in which the order of the arguments of the predicates is not solely determined by the predicates and the arguments can be placed quite freely in the text. In this paper, we propose to reorder the arguments but not the predicates in Japanese using a dependency structure as a kind of reordering. Instead of a single deterministically given permutation, we generate multiple reordered phrases for each sentence and translate them independently. Then we apply a re-ranking method using a discriminative approach by Ranking Support Vector Machines (SVM) to re-score the multiple reordered phrase translations. In our experiment with the travel domain corpus BTEC, we gain a 1.22% BLEU score improvement when only 1-best is used for re-ranking and 4.12% BLEU score improvement when n-best is used for Japanese-English translation.

  • Bilingual Cluster Based Models for Statistical Machine Translation

    Hirofumi YAMAMOTO  Eiichiro SUMITA  

     
    PAPER-Applications

      Vol:
    E91-D No:3
      Page(s):
    588-597

    We propose a domain specific model for statistical machine translation. It is well-known that domain specific language models perform well in automatic speech recognition. We show that domain specific language and translation models also benefit statistical machine translation. However, there are two problems with using domain specific models. The first is the data sparseness problem. We employ an adaptation technique to overcome this problem. The second issue is domain prediction. In order to perform adaptation, the domain must be provided, however in many cases, the domain is not known or changes dynamically. For these cases, not only the translation target sentence but also the domain must be predicted. This paper focuses on the domain prediction problem for statistical machine translation. In the proposed method, a bilingual training corpus, is automatically clustered into sub-corpora. Each sub-corpus is deemed to be a domain. The domain of a source sentence is predicted by using its similarity to the sub-corpora. The predicted domain (sub-corpus) specific language and translation models are then used for the translation decoding. This approach gave an improvement of 2.7 in BLEU score on the IWSLT05 Japanese to English evaluation corpus (improving the score from 52.4 to 55.1). This is a substantial gain and indicates the validity of the proposed bilingual cluster based models.

  • Example-Based Transfer of Japanese Adnominal Particles into English

    Eiichiro SUMITA  Hitoshi IIDA  

     
    PAPER-Artificial Intelligence and Cognitive Science

      Vol:
    E75-D No:4
      Page(s):
    585-594

    This paper deals with the problem of translating Japanese adnominal particles into English according to the idea of Example-Based Machine Translation (EBMT) proposed by Nagao. Japanese adnominal particles are important because: (1) they are frequent function words; (2) to translate them into English is difficult because their translations are diversified; (3) EBMT's effectiveness for adnominal particles suggests that EBMT is effective for other function words, e. g., prepositions of European languages. In EBMT, (1) a database which consists of examples (pairs of a source language expression and its target language translation) is prepared as knowledge for translation; (2) an example whose source expression is similar to the input phrase or sentence is retrieved from the example database; (3) by replacements of corresponding words in the target expression of the retrieved example, the translation is obtained. The similarity in EBMT is computed by the summation of the distance between words multiplied by the weight of each word. The authors' method differs from preceding research in two important points: (1) the authors utilize a general thesaurus to compute the distance between words; (2) the authors propose a weight which changes for every input. The feasibility of our approach has been proven through experiments concerning success rate.

  • Document-Level Neural Machine Translation with Associated Memory Network

    Shu JIANG  Rui WANG  Zuchao LI  Masao UTIYAMA  Kehai CHEN  Eiichiro SUMITA  Hai ZHAO  Bao-liang LU  

     
    PAPER-Natural Language Processing

      Pubricized:
    2021/06/24
      Vol:
    E104-D No:10
      Page(s):
    1712-1723

    Standard neural machine translation (NMT) is on the assumption that the document-level context is independent. Most existing document-level NMT approaches are satisfied with a smattering sense of global document-level information, while this work focuses on exploiting detailed document-level context in terms of a memory network. The capacity of the memory network that detecting the most relevant part of the current sentence from memory renders a natural solution to model the rich document-level context. In this work, the proposed document-aware memory network is implemented to enhance the Transformer NMT baseline. Experiments on several tasks show that the proposed method significantly improves the NMT performance over strong Transformer baselines and other related studies.

  • Improving Feature-Rich Transition-Based Constituent Parsing Using Recurrent Neural Networks

    Chunpeng MA  Akihiro TAMURA  Lemao LIU  Tiejun ZHAO  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Pubricized:
    2017/06/05
      Vol:
    E100-D No:9
      Page(s):
    2205-2214

    Conventional feature-rich parsers based on manually tuned features have achieved state-of-the-art performance. However, these parsers are not good at handling long-term dependencies using only the clues captured by a prepared feature template. On the other hand, recurrent neural network (RNN)-based parsers can encode unbounded history information effectively, but they perform not well for small tree structures, especially when low-frequency words are involved, and they cannot use prior linguistic knowledge. In this paper, we propose a simple but effective framework to combine the merits of feature-rich transition-based parsers and RNNs. Specifically, the proposed framework incorporates RNN-based scores into the feature template used by a feature-rich parser. On English WSJ treebank and SPMRL 2014 German treebank, our framework achieves state-of-the-art performance (91.56 F-score for English and 83.06 F-score for German), without requiring any additional unlabeled data.

  • A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System

    Andrew FINCH  Keiji YASUDA  Hideo OKUMA  Eiichiro SUMITA  Satoshi NAKAMURA  

     
    PAPER

      Vol:
    E94-D No:10
      Page(s):
    1889-1900

    The contribution of this paper is two-fold. Firstly, we conduct a large-scale real-world evaluation of the effectiveness of integrating an automatic transliteration system with a machine translation system. A human evaluation is usually preferable to an automatic evaluation, and in the case of this evaluation especially so, since the common machine translation evaluation methods are affected by the length of the translations they are evaluating, often being biassed towards translations in terms of their length rather than the information they convey. We evaluate our transliteration system on data collected in field experiments conducted all over Japan. Our results conclusively show that using a transliteration system can improve machine translation quality when translating unknown words. Our second contribution is to propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the overfitting problem inherent in maximum likelihood training. We demonstrate the effectiveness of our Bayesian segmentation by using it to build a translation model for a phrase-based statistical machine translation (SMT) system trained to perform transliteration by monotonic transduction from character sequence to character sequence. The Bayesian segmentation was used to construct a phrase-table and we compared the quality of this phrase-table to one generated in the usual manner by the state-of-the-art GIZA++ word alignment process used in combination with phrase extraction heuristics from the MOSES statistical machine translation system, by using both to perform transliteration generation within an identical framework. In our experiments on English-Japanese data from the NEWS2010 transliteration generation shared task, we used our technique to bilingually co-segment the training corpus. We then derived a phrase-table from the segmentation from the sample at the final iteration of the training procedure, and the resulting phrase-table was used to directly substitute for the phrase-table extracted by using GIZA++/MOSES. The phrase-table resulting from our Bayesian segmentation model was approximately 30% smaller than that produced by the SMT system's training procedure, and gave an increase in transliteration quality measured in terms of both word accuracy and F-score.

  • Integration of Multiple Bilingually-Trained Segmentation Schemes into Statistical Machine Translation

    Michael PAUL  Andrew FINCH  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E94-D No:3
      Page(s):
    690-697

    This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous source language text in order to improve the translation quality of statistical machine translation (SMT) approaches. The method can be applied to any language pair in which the source language is unsegmented and the target language segmentation is known. In the first step, an iterative bootstrap method is applied to learn multiple segmentation schemes that are consistent with the phrasal segmentations of an SMT system trained on the resegmented bitext. In the second step, multiple segmentation schemes are integrated into a single SMT system by characterizing the source language side and merging identical translation pairs of differently segmented SMT models. Experimental results translating five Asian languages into English revealed that the proposed method of integrating multiple segmentation schemes outperforms SMT models trained on any of the learned word segmentations and performs comparably to available monolingually built segmentation tools.

  • Class-Dependent Modeling for Dialog Translation

    Andrew FINCH  Eiichiro SUMITA  Satoshi NAKAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E92-D No:12
      Page(s):
    2469-2477

    This paper presents a technique for class-dependent decoding for statistical machine translation (SMT). The approach differs from previous methods of class-dependent translation in that the class-dependent forms of all models are integrated directly into the decoding process. We employ probabilistic mixture weights between models that can change dynamically on a sentence-by-sentence basis depending on the characteristics of the source sentence. The effectiveness of this approach is demonstrated by evaluating its performance on travel conversation data. We used this approach to tackle the translation of questions and declarative sentences using class-dependent models. To achieve this, our system integrated two sets of models specifically built to deal with sentences that fall into one of two classes of dialog sentence: questions and declarations, with a third set of models built with all of the data to handle the general case. The technique was thoroughly evaluated on data from 16 language pairs using 6 machine translation evaluation metrics. We found the results were corpus-dependent, but in most cases our system was able to improve translation performance, and for some languages the improvements were substantial.

  • Translation of Untranslatable Words -- Integration of Lexical Approximation and Phrase-Table Extension Techniques into Statistical Machine Translation

    Michael PAUL  Karunesh ARORA  Eiichiro SUMITA  

     
    PAPER-Machine Translation

      Vol:
    E92-D No:12
      Page(s):
    2378-2385

    This paper proposes a method for handling out-of-vocabulary (OOV) words that cannot be translated using conventional phrase-based statistical machine translation (SMT) systems. For a given OOV word, lexical approximation techniques are utilized to identify spelling and inflectional word variants that occur in the training data. All OOV words in the source sentence are then replaced with appropriate word variants found in the training corpus, thus reducing the number of OOV words in the input. Moreover, in order to increase the coverage of such word translations, the SMT translation model is extended by adding new phrase translations for all source language words that do not have a single-word entry in the original phrase-table but only appear in the context of larger phrases. The effectiveness of the proposed methods is investigated for the translation of Hindi to English, Chinese, and Japanese.

  • Paraphrase Lattice for Statistical Machine Translation

    Takashi ONISHI  Masao UTIYAMA  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E94-D No:6
      Page(s):
    1299-1305

    Lattice decoding in statistical machine translation (SMT) is useful in speech translation and in the translation of German because it can handle input ambiguities such as speech recognition ambiguities and German word segmentation ambiguities. In this paper, we show that lattice decoding is also useful for handling input variations. “Input variations” refers to the differences in input texts with the same meaning. Given an input sentence, we build a lattice which represents paraphrases of the input sentence. We call this a paraphrase lattice. Then, we give the paraphrase lattice as an input to a lattice decoder. The lattice decoder searches for the best path of the paraphrase lattice and outputs the best translation. Experimental results using the IWSLT dataset and the Europarl dataset show that our proposed method obtains significant gains in BLEU scores.