1-10hit |
Iku OHAMA Takuya KIDA Hiroki ARIMURA
Latent variable models for relational data enable us to extract the co-cluster structure underlying observed relational data. The Infinite Relational Model (IRM) is a well-known relational model for discovering co-cluster structures with an unknown number of clusters. The IRM and several related models commonly assume that the link probability between two objects depends only on their cluster assignment. However, relational models based on this assumption often lead us to extract many non-informative and unexpected clusters. This is because the cluster structures underlying real-world relationships are often blurred by biases of individual objects. To overcome this problem, we propose a multi-layered framework, which extracts a clear de-blurred co-cluster structure in the presence of object biases. Then, we propose the Multi-Layered Infinite Relational Model (MLIRM) which is a special instance of the proposed framework incorporating the IRM as a co-clustering model. Furthermore, we reveal that some relational models can be regarded as special cases of the MLIRM. We derive an efficient collapsed Gibbs sampler to perform posterior inference for the MLIRM. Experiments conducted using real-world datasets have confirmed that the proposed model successfully extracts clear and interpretable cluster structures from real-world relational data.
Iku OHAMA Hiromi IIDA Takuya KIDA Hiroki ARIMURA
Latent variable models for relational data enable us to extract the co-cluster structures underlying observed relational data. The Infinite Relational Model (IRM) is a well-known relational model for discovering co-cluster structures with an unknown number of clusters. The IRM assumes that the link probability between two objects (e.g., a customer and an item) depends only on their cluster assignment. However, relational models based on this assumption often lead us to extract many non-informative and unexpected clusters. This is because the underlying co-cluster structures in real-world relationships are often destroyed by structured noise that blurs the cluster structure stochastically depending on the pair of related objects. To overcome this problem, in this paper, we propose an extended IRM that simultaneously estimates denoised clear co-cluster structure and a structured noise component. In other words, our proposed model jointly estimates cluster assignment and noise level for each object. We also present posterior probabilities for running collapsed Gibbs sampling to infer the model. Experiments on real-world datasets show that our model extracts a clear co-cluster structure. Moreover, we confirm that the estimated noise levels enable us to extract representative objects for each cluster.
In-Joong KIM Kyu-Young WHANG Hyuk-Yoon KWON
A top-k keyword query in relational databases returns k trees of tuples — where the tuples containing the query keywords are connected via primary key-foreign key relationships — in the order of relevance to the query. Existing works are classified into two categories: 1) the schema-based approach and 2) the schema-free approach. We focus on the former utilizing database schema information for more effective ranking of the query results. Ranking measures used in existing works can be classified into two categories: 1) the size of the tree (i.e., the syntactic score) and 2) ranking measures, such as TF-IDF, borrowed from the information retrieval field. However, these measures do not take into account semantic relevancy among relations containing the tuples in the query results. In this paper, we propose a new ranking method that ranks the query results by utilizing semantic relevancy among relations containing the tuples at the schema level. First, we propose a structure of semantically strongly related relations, which we call the strongly related tree (SRT). An SRT is a tree that maximally connects relations based on the lossless join property. Next, we propose a new ranking method, SRT-Rank, that ranks the query results by a new scoring function augmenting existing ones with the concept of the SRT. SRT-Rank is the first research effort that applies semantic relevancy among relations to ranking the results of keyword queries. To show the effectiveness of SRT-Rank, we perform experiments on synthetic and real datasets by augmenting the representative existing methods with SRT-Rank. Experimental results show that, compared with existing methods, SRT-Rank improves performance in terms of four quality measures — the mean normalized discounted cumulative gain (nDCG), the number of queries whose top-1 result is relevant to the query, the mean reciprocal rank, and the mean average precision — by up to 46.9%, 160.0%, 61.7%, and 63.8%, respectively. In addition, we show that the query performance of SRT-Rank is comparable to or better than those of existing methods.
Mi-Young CHOI Chang-Joo MOON Doo-Kwon BAIK
The Semantic Web uses RDF/RDFS, which can enable a machine to understand web data without human interference. But most web data is not available in RDF/RDFS documents because most web data is still stored in databases. It is much more favorable to use stored data in a database to build the Semantic Web. This paper proposes an enhanced relational RDF/RDFS interoperable data model (ER2iDM) and a transformation procedure from relational data model (RDM) to RDF/RDFS based on ER2iDM. The ER2iDM is a data model that plays the role of an inter-mediator between RDM and RDF/RDFS during a transformation procedure. The data and schema information in the database are migrated to the ER2iDM according to the proposed translation procedures without incurring loss of meaning of the entities, relationships, and data. The RDF/RDFS generation tool makes a RDF/RDFS XML document automatically from the ER2iDM. The proposed ER2iDM and transformation procedure provides detailed guidelines for transformation from RDM to RDF/RDFS unlike existing studies; therefore, we can more efficiently build up the Semantic Web using database stored data.
Hideaki MISAWA Keiichi HORIO Nobuo MOROTOMI Kazumasa FUKUDA Hatsumi TANIGUCHI
In the present paper, we address the problem of extrapolating group proximities from member relations, which we refer to as the group proximity problem. We assume that a relational dataset consists of several groups and that pairwise relations of all members can be measured. Under these assumptions, the goal is to estimate group proximities from pairwise relations. In order to solve the group proximity problem, we present a method based on embedding and distribution mapping, in which all relational data, which consist of pairwise dissimilarities or dissimilarities between members, are transformed into vectorial data by embedding methods. After this process, the distributions of the groups are obtained. Group proximities are estimated as distances between distributions by distribution mapping methods, which generate a map of distributions. As an example, we apply the proposed method to document and bacterial flora datasets. Finally, we confirm the feasibility of using the proposed method to solve the group proximity problem.
In 1982, Buckles and Petry proposed fuzzy relational database for incorporating non-ideal or fuzzy information in a relational database. The fuzzy relational database relies on the specification of similarity relation in order to distinguish each scalar domain in the fuzzy database. These relations are reflexive, symmetric, and max-min transitive. In 1989, Shenoi and Melton extended the fuzzy relational database model of Buckles and Petry to deal with proximity relation for scalar domain. Since reflexivity and symmetry are the only constraints placed on proximity relations, proximity relation is considered as a generalization of similarity relation. However, we realized that naturally relation between fuzzy information is not symmetric. Here, we consider using conditional probability relation to represent similarity between two fuzzy data. Related to the properties of conditional probability relation, we introduce an interesting mathematical relation, called weak similarity relation, as generalization of similarity relation as well as proximity relation in which conditional probability relation is regarded as a concrete example of the weak similarity relation. In this paper, we propose design of fuzzy relational database to deal with conditional probability relation for scalar domain. These relations are reflexive and not symmetric. In addition, we define a notion of asymmetric redundant tuple based on two interpretations generalizing the concept of redundancy in classical relational database. In the relation to data querying, we discuss partitioning of domains with the objective of developing similarity class. Finally, we propose a new definition of partial fuzzy functional dependency (PFFD). Fuzzy functional dependency (FFD) as an extension of functional dependency (FD), usually used in design of fuzzy relational database, can be generated by the PFFD. Inference rules that are similar to Armstrong's Axioms for the FFD are both sound and complete.
Hanmin JUNG Gary Geunbae LEE Won Seug CHOI KyungKoo MIN Jungyun SEO
This paper describes a highly-portable multilingual question answering system on multiple relational databases. We apply techniques which were verified on open-domain text-based question answering, such as semantic category and pattern-based grammars, into natural language interfaces to relational databases. Lexico-semantic pattern (LSP) and multi-level grammars achieve portability of languages, domains, and DB management systems. The LSP-based linguistic processing does not require deep analysis that sacrifices robustness and flexibility, but can handle delicate natural language questions. To maximize portability, we drive three dependency factors into the following two parts: language-dependent part into front linguistic analysis, and domain-dependent and database-dependent parts into backend SQL query generation. We also support session-based dialog by preserving SQL queries created from previous user's question, and then re-generating new SQL query for the successive questions. Experiments with 779 queries generate only constraint-missing errors, which can be easily corrected by adding new terms, of 2.25% for English and 5.67% for Korean.
Takayuki TAMURA Masato OGUCHI Masaru KITSUREGAWA
We developed a PC cluster system which consists of 100 PCs as a test bed for massively parallel query processing. Each PC employs the 200 MHz Pentium Pro CPU and is connected with others through an ATM switch. Because the query processing applications are insensitive to the communication latency and mainly perform integer operations, the ATM connected PC cluster approach can be considered a reasonable solution for high performance database servers with low costs. However, there has been no challenge to construct large scale PC clusters for database applications, as far as the authors know. Though we employed commodity components as much as possible, we developed the DBMS itself, because that was a key component for obtaining high performance in parallel query processing, and there seemed no system which could meet our demand. On each PC node, a server program which acts as a database kernel is running to process the queries in cooperation with other nodes. The kernel was designed to execute pipelined operators and handle voluminous data efficiently, to achieve high performance on complex decision support type queries. We used the standard benchmark, TPC-D, on a 100 GB database to verify the feasibility of our approach, through comparison of our system with commercial parallel systems. As a whole, our system exhibited sufficiently high performance which was competitive with the current TPC-D top records, in spite of not using indices. For some heavy queries in the benchmark, which have high selectivity and joinability, our system performed much better. In addition, we applied transposed file organization to the database for further performance improvement. The transposed file organization vertically partitions the tuples, enabling attribute-by-attribute access to the relations. This resulted in significant performance improvement by reducing the amount of disk I/O and shifting the bottleneck to computation.
Masaru KITSUREGAWA Weikang YANG Satoshi HIRANO Masanobu HARADA Minoru NAKAMURA Kazuhiro SUZUKI TaKayuki TAMURA Mikio TAKAGI
This paper presents an overview of the SDC-I (Super Database Computer I) developed at the University of Tokyo, Japan. The purpose of the project was to build a high performance SQL server which emphasizes query processing over transaction processing. Recently relational database systems tend to be used for heavy decision support queries, which include many join, aggregation, and order-by operations. At present high-end mainframes are used for these applications requiring several hours in some cases. While the system architecture for high traffic transaction processing systems is well established, that for adhoc query processing has not yet adequately understood. SDC-I proved that a parallel machine could attain significant performance improvements over a coventional sequential machine through the exploitation of the high degree of parallelism present in relational query processing. A unique bucket spreading parallel hash join algorithm is employed in SDC, which makes the system very robust in the presense of data skew and allows SDC to attain almost linear performance scalability. SDC adopts a hybrid parallel architecture, where globally it is a shared nothing architecture, that is, modules are connected through the multistage network, but each module itself is a symmetric multiprocessor system. Although most of the hardware elements use commodity microprocessors for improved performance to cost, only the interconnection network incorporates the special function to support our parallel relational algorithm. Data movement over the memory and the network, rather than computation, is heavy for I/O intensive database processing. A dedicated software system was carefully designed for efficient data movement. The implemented prototype consists of two modules. Its hardware and software organization is described. The performance monitoring tool was developed to visualize the system activities, which showed that SDC-I works very efficiently.
Jeong Uk KIM Jae Moon LEE Myunghwan KIM
A tag-partitioned join algorithm is described. The algorithm partitions only one relation, while other partition-based algorithms partition both relations. It is performed as the joinable tuples of one relation are rearranged and some of them are duplicated according to the original sequence of the join attribute values of the other relation. To do this, the algorithm first finds the positions of all the tuples of the other relation which are joinable with each tuple of one relation, and then partitions joinable tuples of one relation into buckets by using the positions found. Final joining is performed on the partitioned relation and the other relation. We analyze and compare the performance of the algorithm with that of other partition-based join algorithms. The comparison shows that our method is better than other partition-based methods under the practical values of the analysis parameters.