The search functionality is under construction.

Keyword Search Result

[Keyword] N-gram(18hit)

1-18hit
  • Multi-Scale Chroma n-Gram Indexing for Cover Song Identification

    Jin S. SEO  

     
    LETTER

      Pubricized:
    2019/10/23
      Vol:
    E103-D No:1
      Page(s):
    59-62

    To enhance cover song identification accuracy on a large-size music archive, a song-level feature summarization method is proposed by using multi-scale representation. The chroma n-grams are extracted in multiple scales to cope with both global and local tempo changes. We derive index from the extracted n-grams by clustering to reduce storage and computation for DB search. Experiments on the widely used music datasets confirmed that the proposed method achieves the state-of-the-art accuracy while reducing cost for cover song search.

  • IoT Malware Analysis and New Pattern Discovery Through Sequence Analysis Using Meta-Feature Information

    Chun-Jung WU  Shin-Ying HUANG  Katsunari YOSHIOKA  Tsutomu MATSUMOTO  

     
    PAPER-Fundamental Theories for Communications

      Pubricized:
    2019/08/05
      Vol:
    E103-B No:1
      Page(s):
    32-42

    A drastic increase in cyberattacks targeting Internet of Things (IoT) devices using telnet protocols has been observed. IoT malware continues to evolve, and the diversity of OS and environments increases the difficulty of executing malware samples in an observation setting. To address this problem, we sought to develop an alternative means of investigation by using the telnet logs of IoT honeypots and analyzing malware without executing it. In this paper, we present a malware classification method based on malware binaries, command sequences, and meta-features. We employ both unsupervised or supervised learning algorithms and text-mining algorithms for handling unstructured data. Clustering analysis is applied for finding malware family members and revealing their inherent features for better explanation. First, the malware binaries are grouped using similarity analysis. Then, we extract key patterns of interaction behavior using an N-gram model. We also train a multiclass classifier to identify IoT malware categories based on common infection behavior. For misclassified subclasses, second-stage sub-training is performed using a file meta-feature. Our results demonstrate 96.70% accuracy, with high precision and recall. The clustering results reveal variant attack vectors and one denial of service (DoS) attack that used pure Linux commands.

  • Latent Words Recurrent Neural Network Language Models for Automatic Speech Recognition

    Ryo MASUMURA  Taichi ASAMI  Takanobu OBA  Sumitaka SAKAUCHI  Akinori ITO  

     
    PAPER-Speech and Hearing

      Pubricized:
    2019/09/25
      Vol:
    E102-D No:12
      Page(s):
    2557-2567

    This paper demonstrates latent word recurrent neural network language models (LW-RNN-LMs) for enhancing automatic speech recognition (ASR). LW-RNN-LMs are constructed so as to pick up advantages in both recurrent neural network language models (RNN-LMs) and latent word language models (LW-LMs). The RNN-LMs can capture long-range context information and offer strong performance, and the LW-LMs are robust for out-of-domain tasks based on the latent word space modeling. However, the RNN-LMs cannot explicitly capture hidden relationships behind observed words since a concept of a latent variable space is not present. In addition, the LW-LMs cannot take into account long-range relationships between latent words. Our idea is to combine RNN-LM and LW-LM so as to compensate individual disadvantages. The LW-RNN-LMs can support both a latent variable space modeling as well as LW-LMs and a long-range relationship modeling as well as RNN-LMs at the same time. From the viewpoint of RNN-LMs, LW-RNN-LM can be considered as a soft class RNN-LM with a vast latent variable space. In contrast, from the viewpoint of LW-LMs, LW-RNN-LM can be considered as an LW-LM that uses the RNN structure for latent variable modeling instead of an n-gram structure. This paper also details a parameter inference method and two kinds of implementation methods, an n-gram approximation and a Viterbi approximation, for introducing the LW-LM to ASR. Our experiments show effectiveness of LW-RNN-LMs on a perplexity evaluation for the Penn Treebank corpus and an ASR evaluation for Japanese spontaneous speech tasks.

  • Error Correction for Search Engine by Mining Bad Case

    Jianyong DUAN  Tianxiao JI  Hao WANG  

     
    PAPER-Natural Language Processing

      Pubricized:
    2018/03/26
      Vol:
    E101-D No:7
      Page(s):
    1938-1945

    Automatic error correction of users' search terms for search engines is an important aspect of improving search engine retrieval efficiency, accuracy and user experience. In the era of big data, we can analyze and mine massive search engine logs to release the hidden mind with big data ideas. It can obtain better results through statistical modeling of query errors in search engine log data. But when we cannot find the error query in the log, we can't make good use of the information in the log to correct the query result. These undiscovered error queries are called Bad Case. This paper combines the error correction algorithm model and search engine query log mining analysis. First, we explored Bad Cases in the query error correction process through the search engine query logs. Then we quantified the characteristics of these Bad Cases and built a model to allow search engines to automatically mine Bad Cases with these features. Finally, we applied Bad Cases to the N-gram error correction algorithm model to check the impact of Bad Case mining on error correction. The experimental results show that the error correction based on Bad Case mining makes the precision rate and recall rate of the automatic error correction improved obviously. Users experience is improved and the interaction becomes more friendly.

  • N-gram Approximation of Latent Words Language Models for Domain Robust Automatic Speech Recognition Open Access

    Ryo MASUMURA  Taichi ASAMI  Takanobu OBA  Hirokazu MASATAKI  Sumitaka SAKAUCHI  Satoshi TAKAHASHI  

     
    PAPER-Language modeling

      Pubricized:
    2016/07/19
      Vol:
    E99-D No:10
      Page(s):
    2462-2470

    This paper aims to improve the domain robustness of language modeling for automatic speech recognition (ASR). To this end, we focus on applying the latent words language model (LWLM) to ASR. LWLMs are generative models whose structure is based on Bayesian soft class-based modeling with vast latent variable space. Their flexible attributes help us to efficiently realize the effects of smoothing and dimensionality reduction and so address the data sparseness problem; LWLMs constructed from limited domain data are expected to robustly cover unknown multiple domains in ASR. However, the attribute flexibility seriously increases computation complexity. If we rigorously compute the generative probability for an observed word sequence, we must consider the huge quantities of all possible latent word assignments. Since this is computationally impractical, some approximation is inevitable for ASR implementation. To solve the problem and apply this approach to ASR, this paper presents an n-gram approximation of LWLM. The n-gram approximation is a method that approximates LWLM as a simple back-off n-gram structure, and offers LWLM-based robust one-pass ASR decoding. Our experiments verify the effectiveness of our approach by evaluating perplexity and ASR performance in not only in-domain data sets but also out-of-domain data sets.

  • Automated Duplicate Bug Report Detection Using Multi-Factor Analysis

    Jie ZOU  Ling XU  Mengning YANG  Xiaohong ZHANG  Jun ZENG  Sachio HIROKAWA  

     
    PAPER-Software Engineering

      Pubricized:
    2016/04/01
      Vol:
    E99-D No:7
      Page(s):
    1762-1775

    The bug reports expressed in natural language text usually suffer from vast, ambiguous and poorly written, which causes the challenge to the duplicate bug reports detection. Current automatic duplicate bug reports detection techniques have mainly focused on textual information and ignored some useful factors. To improve the detection accuracy, in this paper, we propose a new approach calls LNG (LDA and N-gram) model which takes advantages of the topic model LDA and word-based model N-gram. The LNG considers multiple factors, including textual information, semantic correlation, word order, contextual connections, and categorial information, that potentially affect the detection accuracy. Besides, the N-gram adopted in our LNG model is improved by modifying the similarity algorithm. The experiment is conducted under more than 230,000 real bug reports of the Eclipse project. In the evaluation, we propose a new evaluation metric, namely exact-accuracy (EA) rate, which can be used to enhance the understanding of the performance of duplicates detection. The evaluation results show that all the recall rate, precision rate, and EA rate of the proposed method are higher than treating them separately. Also, the recall rate is improved by 2.96%-10.53% compared to the state-of-art approach DBTM.

  • Diagnosis of Stochastic Discrete Event Systems Based on N-Gram Models with Wildcard Characters

    Kunihiko HIRAISHI  Koichi KOBAYASHI  

     
    PAPER

      Vol:
    E99-A No:2
      Page(s):
    462-467

    In previous papers by the authors, a new scheme for diagnosis of stochastic discrete event systems, called sequence profiling (SP), is proposed. From given event logs, N-gram models that approximate the behavior of the target system are extracted. N-gram models are used for discovering discrepancy between observed event logs and the behavior of the system in the normal situation. However, when the target system is a distributed system consisting of several subsystems, event sequences from subsystems may be interleaved, and SP cannot separate the faulty event sequence from the interleaved sequence. In this paper, we introduce wildcard characters into event patterns. This contributes to removing the effect by subsystems which may not be related to faults.

  • Diagnosis of Stochastic Discrete Event Systems Based on N-gram Models

    Miwa YOSHIMOTO  Koichi KOBAYASHI  Kunihiko HIRAISHI  

     
    PAPER

      Vol:
    E98-A No:2
      Page(s):
    618-625

    In this paper, we present a new method for diagnosis of stochastic discrete event system. The method is based on anomaly detection for sequences. We call the method sequence profiling (SP). SP does not require any system models and any system-specific knowledge. The only information necessary for SP is event logs from the target system. Using event logs from the system in the normal situation, N-gram models are learned, where the N-gram model is used as approximation of the system behavior. Based on the N-gram model, the diagnoser estimates what kind of faults has occurred in the system, or may conclude that no faults occurs. Effectiveness of the proposed method is demonstrated by application to diagnosis of a multi-processor system.

  • Link Analysis Based on Rhetorical Relations for Multi-Document Summarization

    Nik Adilah Hanin BINTI ZAHRI  Fumiyo FUKUMOTO  Suguru MATSUYOSHI  

     
    PAPER-Natural Language Processing

      Vol:
    E96-D No:5
      Page(s):
    1182-1191

    This paper presents link analysis based on rhetorical relations with the aim of performing extractive summarization for multiple documents. We first extracted sentences with salient terms from individual document using statistical model. We then ranked the extracted sentences by measuring their relative importance according to their connectivity among the sentences in the document set using PageRank based on the rhetorical relations. The rhetorical relations were examined beforehand to determine which relations are crucial to this task, and the relations among sentences from documents were automatically identified by SVMs. We used the relations to emphasize important sentences during sentence ranking by PageRank and eliminate redundancy from the summary candidates. Our framework omits fully annotated sentences by humans and the evaluation results show that the combination of PageRank along with rhetorical relations does help to improve the quality of extractive summarization.

  • Class-Based N-Gram Language Model for New Words Using Out-of-Vocabulary to In-Vocabulary Similarity

    Welly NAPTALI  Masatoshi TSUCHIYA  Seiichi NAKAGAWA  

     
    PAPER-Speech and Hearing

      Vol:
    E95-D No:9
      Page(s):
    2308-2317

    Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they miss-recognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class.

  • Analysis of Eye Movements and Linguistic Boundaries in a Text for the Investigation of Japanese Reading Processes

    Akemi TERA  Kiyoaki SHIRAI  Takaya YUIZONO  Kozo SUGIYAMA  

     
    PAPER-Knowledge Acquisition

      Vol:
    E91-D No:11
      Page(s):
    2560-2567

    In order to investigate reading processes of Japanese language learners, we have conducted an experiment to record eye movements during Japanese text reading using an eye-tracking system. We showed that Japanese native speakers use "forward and backward jumping eye movements" frequently [13] [14]. In this paper, we analyzed further the same eye tracking data. Our goal is to examine whether Japanese learners fix their eye movements at boundaries of linguistic units such as words, phrases or clauses when they start or end "backward jumping". We consider conventional linguistic boundaries as well as boundaries empirically defined based on the entropy of the N-gram model. Another goal is to examine the relation between the entropy of the N-gram model and the depth of syntactic structures of sentences. Our analysis shows that (1) Japanese learners often fix their eyes at linguistic boundaries, (2) the average of the entropy is the greatest at the fifth depth of syntactic structures.

  • Statistical Language Models for On-Line Handwriting Recognition

    Freddy PERRAUD  Christian VIARD-GAUDIN  Emmanuel MORIN  Pierre-Michel LALLICAN  

     
    PAPER-On-line Word Recognition

      Vol:
    E88-D No:8
      Page(s):
    1807-1814

    This paper incorporates statistical language models into an on-line handwriting recognition system for devices with limited memory and computational resources. The objective is to minimize the error recognition rate by taking into account the sentence context to disambiguate poorly written texts. Probabilistic word n-grams have been first investigated, then to fight the curse of dimensionality problem induced by such an approach and to decrease significantly the size of the language model an extension to class-based n-grams has been achieved. In the latter case, the classes result either from a syntactic criterion or a contextual criteria. Finally, a composite model is proposed; it combines both previous kinds of classes and exhibits superior performances compared with the word n-grams model. We report on many experiments involving different European languages (English, French, and Italian), they are related either to language model evaluation based on the classical perplexity measurement on test text corpora but also on the evolution of the word error rate on test handwritten databases. These experiments show that the proposed approach significantly improves on state-of-the-art n-gram models, and that its integration into an on-line handwriting recognition system demonstrates a substantial performance improvement.

  • Splitting Input for Machine Translation Using N-gram Language Model Together with Utterance Similarity

    Takao DOI  Eiichiro SUMITA  

     
    PAPER-Natural Language Processing

      Vol:
    E88-D No:6
      Page(s):
    1256-1264

    In order to boost the translation quality of corpus-based MT systems for speech translation, the technique of splitting an input utterance appears promising. In previous research, many methods used word-sequence characteristics like N-gram clues among splitting positions. In this paper, to supplement splitting methods based on word-sequence characteristics, we introduce another clue using similarity based on edit-distance. In our splitting method, we generate candidates for utterance splitting based on N-grams, and select the best one by measuring the utterance similarity against a corpus. This selection is founded on the assumption that a corpus-based MT system can correctly translate an utterance that is similar to an utterance in its training corpus. We conducted experiments using three MT systems: two EBMT systems, one of which uses a phrase as a translation unit and the other of which uses an utterance, and an SMT system. The translation results under various conditions were evaluated by objective measures and a subjective measure. The experimental results demonstrate that the proposed method is valuable for the three systems. Using utterance similarity can improve the translation quality.

  • Language Modeling Using Patterns Extracted from Parse Trees for Speech Recognition

    Takatoshi JITSUHIRO  Hirofumi YAMAMOTO  Setsuo YAMADA  Genichiro KIKUI  Yoshinori SAGISAKA  

     
    PAPER-Speech and Speaker Recognition

      Vol:
    E86-D No:3
      Page(s):
    446-453

    We propose new language models to represent phrasal structures by patterns extracted from parse trees. First, modified word trigram models are proposed. They are extracted from sentences analyzed by the preprocessing of the parser with knowledge. Since sentences are analyzed to create sub-trees of a few words, these trigram models can represent relations among a few neighbor words more strongly than conventional word trigram models. Second, word pattern models are used on these modified word trigram models. The word patterns are extracted from parse trees and can represent phrasal structures and much longer word-dependency than trigram models. Experimental results show that modified trigram models are more effective than traditional trigram models and that pattern models attain slight improvements over modified trigram models. Furthermore, additional experiments show that pattern models are more effective for long sentences.

  • N-Gram Modeling Based on Recognized Phonemes in Automatic Language Identification

    Hingkeung KWAN  Keikichi HIROSE  

     
    PAPER-Speech Processing and Acoustics

      Vol:
    E81-D No:11
      Page(s):
    1224-1231

    Due to a rather low phoneme recognition rate for noisy telephone speech, there may arise large differences between N-gram built upon recognized phoneme labels and those built upon original attached phoneme labels, which in turn would affect the performances of N-gram based language identification methods. Use of N-gram built upon recognized phoneme labels from the training data was evaluated and was shown to be more effective for the language identification. The performance of mixed phoneme recognizer, in which both language-dependent and language-independent phonemes were included, was also evaluated. Results showed that the performance was better than that using parallel language-dependent phoneme recognizers in which bias existed due to different numbers of phonemes among languages.

  • Two-Step Extraction of Bilingual Collocations by Using Word-Level Sorting

    Masahiko HARUNO  Satoru IKEHARA  

     
    PAPER-Artificial Intelligence and Cognitive Science

      Vol:
    E81-D No:10
      Page(s):
    1103-1110

    This paper describes a new method for learning bilingual collocations from sentence-aligned parallel corpora. Our method comprises two steps: (1) extracting useful word chunks (n-grams) in each language by word-level sorting and (2) constructing bilingual collocations by combining the word-chunks acquired in stage (1). We apply the method to a two kinds of Japanese-English texts; (1) scientific articles that comprise relatively literal translations and (2) more challenging texts: a stock market bulletin in Japanese and its abstract in English. In both cases, domain specific collocations are well captured even if they were not contained in the dictionaries of specialized terms.

  • Robust n-Gram Model of Japanese Character and Its Application to Document Recognition

    Hiroki MORI  Hirotomo ASO  Shozo MAKINO  

     
    PAPER-Postprocessing

      Vol:
    E79-D No:5
      Page(s):
    471-476

    A new postprocessing method using interpolated n-gram model for Japanese documents is proposed. The method has the advantages over conventional approaches in enabling high-speed, knowledge-free processing. In parameter estimation of an n-gram model for a large size of vocabulary, it is difficult to obtain sufficient training samples. To overcome poverty of samples, two smoothing methods for Japanese character trigram model are evaluated, and the superiority of deleted interpolation method is shown by using perplexity. A document recognition system based on the trigram model is constructed, which finds maximum likelihood solutions through Viterbi algorithm. Experimental results for three kinds of documents show that the performance is high when using deleted interpolation method for smoothing. 90% of OCR errors are corrected for the documents similar to training text data, and 75% of errors are corrected for the documents not so similar to training text data.

  • Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams

    Ryosuke ISOTANI  Shoichi MATSUNAGA  Shigeki SAGAYAMA  

     
    PAPER

      Vol:
    E78-D No:6
      Page(s):
    692-697

    This paper proposes a new stochastic language model for speech recognition based on function-word N-grams and content-word N-grams. The conventional word N-gram models are effective for speech recognition, but they represent only local constraints within a few successive words and lack the ability to capture global syntactic or semantic relationships between words. To represent more global constraints, the proposed language model gives the N-gram probabilities of word sequences, with attention given only to function words or to content words. The sequences of function words and of content words are expected to represent syntactic and semantic constraints, respectively. Probabilities of function-word bigrams and content-word bigrams were estimated from a 10,000-sentence text database, and analysis using information theoretic measure showed that expected constraints were extracted appropriately. As an application of this model to speech recognition, a post-processor was constructed to select the optimum sentence candidate from a phrase lattice obtained by a phrase recognition system. The phrase candidate sequence with the highest total acoustic and linguistic score was sought by dynamic programming. The results of experiments carried out on the utterances of 12 speakers showed that the proposed method is more accurate than a CFG-based method, thus demonstrating its effectiveness in improving speech recognition performance.