The search functionality is under construction.

Keyword Search Result

[Keyword] suffix tree(8hit)

1-8hit
  • Packed Compact Tries: A Fast and Efficient Data Structure for Online String Processing

    Takuya TAKAGI  Shunsuke INENAGA  Kunihiko SADAKANE  Hiroki ARIMURA  

     
    PAPER

      Vol:
    E100-A No:9
      Page(s):
    1785-1793

    We present a new data structure called the packed compact trie (packed c-trie) which stores a set S of k strings of total length n in nlog σ+O(klog n) bits of space and supports fast pattern matching queries and updates, where σ is the alphabet size. Assume that α=logσn letters are packed in a single machine word on the standard word RAM model, and let f(k,n) denote the query and update times of the dynamic predecessor/successor data structure of our choice which stores k integers from universe [1,n] in O(klog n) bits of space. Then, given a string of length m, our packed c-tries support pattern matching queries and insert/delete operations in $O( rac{m}{alpha} f(k,n))$ worst-case time and in $O( rac{m}{alpha} + f(k,n))$ expected time. Our experiments show that our packed c-tries are faster than the standard compact tries (a.k.a. Patricia trees) on real data sets. We also discuss applications of our packed c-tries.

  • A Fast Parallel Algorithm for Indexing Human Genome Sequences

    Woong-Kee LOH  Kyoung-Soo HAN  

     
    LETTER-Data Engineering, Web Information Systems

      Vol:
    E97-D No:5
      Page(s):
    1345-1348

    A suffix tree is widely adopted for indexing genome sequences. While supporting highly efficient search, the suffix tree has a few shortcomings such as very large size and very long construction time. In this paper, we propose a very fast parallel algorithm to construct a disk-based suffix tree for human genome sequences. Our algorithm constructs a suffix array for part of the suffixes in the human genome sequence and then converts it into a suffix tree very quickly. It outperformed the previous algorithms by Loh et al. and Barsky et al. by up to 2.09 and 3.04 times, respectively.

  • A Storage-Efficient Suffix Tree Construction Algorithm for Human Genome Sequences

    Woong-Kee LOH  Heejune AHN  

     
    LETTER-Biological Engineering

      Vol:
    E94-D No:12
      Page(s):
    2557-2560

    The suffix tree is one of most widely adopted indexes in the application of genome sequence alignment. Although it supports very fast alignment, it has a couple of shortcomings, such as a very long construction time and a very large volume size. Loh et al. [7] proposed a suffix tree construction algorithm with dramatically improved performance; however, the size still remains as a challenging problem. We propose an algorithm by extending the one by Loh et al. to reduce the suffix tree size. As a result of our experiments, our algorithm constructed a suffix tree of approximately 60% of the size within almost the same time period.

  • A Fast Divide-and-Conquer Algorithm for Indexing Human Genome Sequences

    Woong-Kee LOH  Yang-Sae MOON  Wookey LEE  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E94-D No:7
      Page(s):
    1369-1377

    Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on `divide-and-conquer' strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.

  • On the Construction of an Antidictionary with Linear Complexity Using the Suffix Tree

    Takahiro OTA  Hiroyoshi MORITA  

     
    PAPER-Algorithms and Data Structures

      Vol:
    E90-A No:11
      Page(s):
    2533-2539

    The antidictionary of a string is the set of all words of minimal length that never appear in this string. Antidictionaries are in particular useful for source coding. We present a fast and memory-efficient algorithm to construct an antidictionary using a suffix tree. It is proved that the complexity of this algorithm is linear in space and time, and its effectiveness is demonstrated by simulation results.

  • Substring Count Estimation in Extremely Long Strings

    Jinuk BAE  Sukho LEE  

     
    PAPER-Database

      Vol:
    E89-D No:3
      Page(s):
    1148-1156

    To estimate the number of substring matches against string data, count suffix trees (CS-tree) have been used as a kind of alphanumeric histograms. Although the trees are useful for substring count estimation in short data strings (e.g. name or title), they reveal several drawbacks when the target is changed to extremely long strings. First, it becomes too hard or at least slow to build CS-trees, because their origin, the suffix tree, has memory-bottleneck problem with long strings. Secondly, some of CS-tree-node counts are incorrect due to frequent pruning of nodes. Therefore, we propose the count q-gram tree (CQ-tree) as an alphanumeric histogram for long strings. By adopting q-grams (or length-q substrings), CQ-trees can be created fast and correctly within small available memory. Furthermore, we mathematically provide the lower and upper bounds that the count estimation can reach to. To the best of our knowledge, our work is the first one to present such bounds among research activities to estimate the alphanumeric selectivity. Our experimental study shows that the CQ-tree outperforms the CS-tree in terms of the building time and accuracy.

  • Constructing the Suffix Tree of a Tree with a Large Alphabet

    Tetsuo SHIBUYA  

     
    PAPER

      Vol:
    E86-A No:5
      Page(s):
    1061-1066

    The problem of constructing the suffix tree of a tree is a generalization of the problem of constructing the suffix tree of a string. It has many applications, such as in minimizing the size of sequential transducers and in tree pattern matching. The best-known algorithm for this problem is Breslauer's O(nlog |Σ|) time algorithm where n is the size of the CS-tree and |Σ| is the alphabet size, which requires O(nlog n) time if |Σ| is large. We improve this bound by giving an optimal linear time algorithm for integer alphabets. We also describe a new data structure, the Bsuffix tree, which enables efficient query for patterns of completely balanced k-ary trees from a k-ary tree or forest. We also propose an optimal O(n) algorithm for constructing the Bsuffix tree for integer alphabets.

  • A Linear Time Pattern Matching Algorithm between a String and a Tree

    Tatsuya AKUTSU  

     
    PAPER-Algorithm and Computational Complexity

      Vol:
    E77-D No:3
      Page(s):
    281-287

    This paper presents a linear time algorithm for testing whether or not there is a path ,vm> of an undiercted tree T (|V(T)|n) that coincides with a string ss1sm (i.e., label(v1)label(vm)s1sm). Since any path of the tree is allowed, linear time substring matching algorithms can not be directly applied and a new method is developed. In the algorithm, O(n/m) vertices are selected from V(T) such that any path pf length more than m 2 must contain at least one of the selected vertices. A search is performed using the selected vertices as 'bases' and two tables of size O(m) are constructed for each of the selected vertices. A suffix tree, which is a well-known-data structure in string matching, is used effectively in the algorithm. From each of the selected vertices, a search is performed with traversing the suffix tree associated with s. Although the size of the alphabet is assumed to be bounded by a constant in this paper, the algorithm can be applied to the case of unbounded alphabets by increasing the time complexity to O(n log m).