1-2hit |
This article presents an algorithm that solves an on-line version of the longest common subsequence (LCS) problem for two strings over a constant alphabet in O(d+n) time and O(m+d) space, where m is the length of the shorter string, the whole of which is given to the algorithm in advance, n is the length of the longer string, which is given as a data stream, and d is the number of dominant matches between the two strings. A new upper bound, O(p(m-q)), of d is also presented, where p is the length of the LCS of the two strings, and q is the length of the LCS of the shorter string and the m-length prefix of the longer string.
In this paper, we propose a method for detecting conserved domains from a set of amino acid sequences that belong to a protein family. This method detects the domains as follows: first, generate fixed-length subsequences from the sequences; second, construct a weighted graph that connects any two of the subsequences (vertices) having higher similarity than a pre-defined threshold; third, search for the maximum-density subgraph for each connected component of the graph; finally, explore conserved domains in the sequences by combining the results of the previous step. From the performance results obtained by applying the method to several protein families that have complex conserved domains, we found that our method was able to detect those domains even though some domains were weakly conserved.