1-11hit |
Hiroki ARIMURA Akihiro YAMAMOTO
Inductive Logic Programming (ILP) is a study of machine learning systems that use clausal theories in first-order logic as a representation language. In this paper, we survey theoretical foundations of ILP from the viewpoints of Logic of Discovery and Machine Learning, and try to unify these two views with the support of the modern theory of Logic Programming. Firstly, we define several hypothesis construction methods in ILP and give their proof-theoretic foundations by treating them as a procedure which complets incomplete proofs. Next, we discuss the design of individual learning algorithms using these hypothesis construction methods. We review known results on learning logic programs in computational learning theory, and show that these algorithms are instances of a generic learning strategy with proof completion methods.
Kunihiro WASA Katsuhisa YAMANAKA Hiroki ARIMURA
Given two feasible solutions A and B, a reconfiguration problem asks whether there exists a reconfiguration sequence (A0=A, A1,...,Aℓ=B) such that (i) A0,...,Aℓ are feasible solutions and (ii) we can obtain Ai from Ai-1 under the prescribed rule (the reconfiguration rule) for each i ∈ {1,...,ℓ}. In this paper, we address the reconfiguration problem for induced trees, where an induced tree is a connected and acyclic induced subgraph of an input graph. We consider the following two rules as the prescribed rules: Token Jumping: removing u from an induced tree and adding v to the tree, and Token Sliding: removing u from an induced tree and adding v adjacent to u to the tree, where u and v are vertices of an input graph. As the main results, we show that (I) the reconfiguration problemis PSPACE-complete even if the input graph is of bounded maximum degree, (II) the reconfiguration problem is W[1]-hard when parameterized by both the size of induced trees and the length of the reconfiguration sequence, and (III) there exists an FPT algorithm when the problem is parameterized by both the size of induced trees and the maximum degree of an input graph under Token Jumping and Token Sliding.
Yoichi SASAKI Tetsuo SHIBUYA Kimihito ITO Hiroki ARIMURA
In this paper, we study the approximate point set matching (APSM) problem with minimum RMSD score under translation, rotation, and one-to-one correspondence in d-dimension. Since most of the previous works about APSM problems use similality scores that do not especially care about one-to-one correspondence between points, such as Hausdorff distance, we cannot easily apply previously proposed methods to our APSM problem. So, we focus on speed-up of exhaustive search algorithms that can find all approximate matches. First, we present an efficient branch-and-bound algorithm using a novel lower bound function of the minimum RMSD score for the enumeration version of APSM problem. Then, we modify this algorithm for the optimization version. Next, we present another algorithm that runs fast with high probability when a set of parameters are fixed. Experimental results on both synthetic datasets and real 3-D molecular datasets showed that our branch-and-bound algorithm achieved significant speed-up over the naive algorithm still keeping the advantage of generating all answers.
Tatsuya ASAI Kenji ABE Shinji KAWASOE Hiroshi SAKAMOTO Hiroki ARIMURA Setsuo ARIKAWA
In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.
Takuya TAKAGI Shunsuke INENAGA Kunihiko SADAKANE Hiroki ARIMURA
We present a new data structure called the packed compact trie (packed c-trie) which stores a set S of k strings of total length n in nlog σ+O(klog n) bits of space and supports fast pattern matching queries and updates, where σ is the alphabet size. Assume that α=logσn letters are packed in a single machine word on the standard word RAM model, and let f(k,n) denote the query and update times of the dynamic predecessor/successor data structure of our choice which stores k integers from universe [1,n] in O(klog n) bits of space. Then, given a string of length m, our packed c-tries support pattern matching queries and insert/delete operations in $O(rac{m}{alpha} f(k,n))$ worst-case time and in $O(rac{m}{alpha} + f(k,n))$ expected time. Our experiments show that our packed c-tries are faster than the standard compact tries (a.k.a. Patricia trees) on real data sets. We also discuss applications of our packed c-tries.
Yusaku KANETA Shingo YOSHIZAWA Shin-ichi MINATO Hiroki ARIMURA Yoshikazu MIYANAGA
In this paper, we propose a novel architecture for large-scale regular expression matching, called dynamically reconfigurable bit-parallel NFA architecture (Dynamic BP-NFA), which allows dynamic loading of regular expressions on-the-fly as well as efficient pattern matching for fast data streams. This is the first dynamically reconfigurable hardware with guaranteed performance for the class of extended patterns, which is a subclass of regular expressions consisting of union of characters and its repeat. This class allows operators such as character classes, gaps, optional characters, and bounded and unbounded repeats of character classes. The key to our architecture is the use of bit-parallel pattern matching approach, in which the information of an input non-deterministic finite automaton (NFA) is first compactly encoded in bit-masks stored in a collection of registers and block RAMs. Then, the NFA is efficiently simulated by a fixed circuitry using bitwise Boolean and arithmetic operations consuming one input character per clock regardless of the actual contents of an input text. Experimental results showed that our hardwares for both string and extended patterns were comparable to previous dynamically reconfigurable hardwares in their performances.
Kazuhiro KURITA Kunihiro WASA Takeaki UNO Hiroki ARIMURA
In this study, we address a problem pertaining to the induced matching enumeration. An edge set M is an induced matching of a graph G=(V,E). The enumeration of matchings has been widely studied in literature; however, there few studies on induced matching. A straightforward algorithm takes O(Δ2) time for each solution that is coming from the time to generate a subproblem, where Δ is the maximum degree in an input graph. To generate a subproblem, an algorithm picks up an edge e and generates two graphs, the one is obtained by removing e from G, the other is obtained by removing e, adjacent edge to e, and edges adjacent to adjacent edge of e. Since this operation needs O(Δ2) time, a straightforward algorithm enumerates all induced matchings in O(Δ2) time per solution. We investigated local structures that enable us to generate subproblems within a short time and proved that the time complexity will be O(1) if the input graph is C4-free. A graph is C4-free if and only if none of its subgraphs have a cycle of length four.
Iku OHAMA Takuya KIDA Hiroki ARIMURA
Latent variable models for relational data enable us to extract the co-cluster structure underlying observed relational data. The Infinite Relational Model (IRM) is a well-known relational model for discovering co-cluster structures with an unknown number of clusters. The IRM and several related models commonly assume that the link probability between two objects depends only on their cluster assignment. However, relational models based on this assumption often lead us to extract many non-informative and unexpected clusters. This is because the cluster structures underlying real-world relationships are often blurred by biases of individual objects. To overcome this problem, we propose a multi-layered framework, which extracts a clear de-blurred co-cluster structure in the presence of object biases. Then, we propose the Multi-Layered Infinite Relational Model (MLIRM) which is a special instance of the proposed framework incorporating the IRM as a co-clustering model. Furthermore, we reveal that some relational models can be regarded as special cases of the MLIRM. We derive an efficient collapsed Gibbs sampler to perform posterior inference for the MLIRM. Experiments conducted using real-world datasets have confirmed that the proposed model successfully extracts clear and interpretable cluster structures from real-world relational data.
Iku OHAMA Hiromi IIDA Takuya KIDA Hiroki ARIMURA
Latent variable models for relational data enable us to extract the co-cluster structures underlying observed relational data. The Infinite Relational Model (IRM) is a well-known relational model for discovering co-cluster structures with an unknown number of clusters. The IRM assumes that the link probability between two objects (e.g., a customer and an item) depends only on their cluster assignment. However, relational models based on this assumption often lead us to extract many non-informative and unexpected clusters. This is because the underlying co-cluster structures in real-world relationships are often destroyed by structured noise that blurs the cluster structure stochastically depending on the pair of related objects. To overcome this problem, in this paper, we propose an extended IRM that simultaneously estimates denoised clear co-cluster structure and a structured noise component. In other words, our proposed model jointly estimates cluster assignment and noise level for each object. We also present posterior probabilities for running collapsed Gibbs sampling to infer the model. Experiments on real-world datasets show that our model extracts a clear co-cluster structure. Moreover, we confirm that the estimated noise levels enable us to extract representative objects for each cluster.
Hiroki ARIMURA Takeshi SHINOHARA Setsuko OTSUKI
In this paper, we consider the polynomial time inferability from positive data for unions of two tree pattern languages. A tree pattern is a structured pattern known as a term in logic programming, and a tree pattern language is the set of all ground instances of a tree pattern. We present a polynomial time algorithm to find a minimal union of two tree pattern languages containing given examples. Our algorithm can be considered as a natural extension of Plotkin's least generalization algorithm, which finds a minimal single tree pattern language. By using this algorithm, we can realize a consistent and conservative polynomial time inference machine that identifies unions of two tree pattern languages from positive data in the limit.
Kunihiro WASA Yusaku KANETA Takeaki UNO Hiroki ARIMURA
By the motivation to discover patterns in massive structured data in the form of graphs and trees, we study a special case of the k-subtree enumeration problem with a tree of n nodes as an input graph, which is originally introduced by (Ferreira, Grossi, and Rizzi, ESA'11, 275-286, 2011) for general graphs. Based on reverse search technique (Avis and Fukuda, Discrete Appl. Math., vol.65, pp.21-46, 1996), we present the first constant delay enumeration algorithm that lists all k-subtrees of an input rooted tree in O(1) worst-case time per subtree. This result improves on the straightforward application of Ferreira et al.'s algorithm with O(k) amortized time per subtree when an input is restricted to tree. Finally, we discuss an application of our algorithm to a modification of the graph motif problem for trees.