1-8hit |
Woong-Kee LOH Yang-Sae MOON Jun-Gyu KANG
In this paper, we emphasize the need for data cleansing when clustering large-scale transaction databases and propose a new data cleansing method that improves clustering quality and performance. We evaluate our data cleansing method through a series of experiments. As a result, the clustering quality and performance were significantly improved by up to 165% and 330%, respectively.
Woong-Kee LOH Yang-Sae MOON Heejune AHN
We propose a robust and efficient algorithm called ROCKET for clustering large-scale transaction databases. ROCKET is a divisive hierarchical algorithm that makes the most of recent hardware architecture. ROCKET handles the cases with the small and the large number of similar transaction pairs separately and efficiently. Through experiments, we show that ROCKET achieves high-quality clustering with a dramatic performance improvement.
Woong-Kee LOH Yang-Sae MOON Wookey LEE
Since the release of human genome sequences, one of the most important research issues is about indexing the genome sequences, and the suffix tree is most widely adopted for that purpose. The traditional suffix tree construction algorithms suffer from severe performance degradation due to the memory bottleneck problem. The recent disk-based algorithms also provide limited performance improvement due to random disk accesses. Moreover, they do not fully utilize the recent CPUs with multiple cores. In this paper, we propose a fast algorithm based on `divide-and-conquer' strategy for indexing the human genome sequences. Our algorithm nearly eliminates random disk accesses by accessing the disk in the unit of contiguous chunks. In addition, our algorithm fully utilizes the multi-core CPUs by dividing the genome sequences into multiple partitions and then assigning each partition to a different core for parallel processing. Experimental results show that our algorithm outperforms the previous fastest DIGEST algorithm by up to 10.5 times.
Byung-Kwon PARK Woong-Kee LOH Jeong-Joon LEE Chong-Mok PARK Kyu-Young WHANG
In this paper, we describe the design and implementation of a hypermedia system that has the following characteristics. First, being designed according to the Dexter hypertext reference model, it has a layered architecture and thus maintains commonality with other hypermedia systems based on the Dextermodel. Second, being designed based on the MHEG standard, it has data structures that are inherently suitable for data interchange and synchronization. Third, adopting the MIME protocol, it provides multimedia mail services. Finally, being built on top of an object-oriented DBMS, it makes it easy to represent Dexter and MHEG models and also provides efficient storage and search capabilities. The contribution of this paper is combining these characteristics to build an integrated hypermedia system reflecting reference architectures and international standard efforts.
The suffix tree is one of most widely adopted indexes in the application of genome sequence alignment. Although it supports very fast alignment, it has a couple of shortcomings, such as a very long construction time and a very large volume size. Loh et al. [7] proposed a suffix tree construction algorithm with dramatically improved performance; however, the size still remains as a challenging problem. We propose an algorithm by extending the one by Loh et al. to reduce the suffix tree size. As a result of our experiments, our algorithm constructed a suffix tree of approximately 60% of the size within almost the same time period.
Woong-Kee LOH Sang-Wook KIM Kyu-Young WHANG
In this paper we propose a subsequence matching algorithm that supports moving average transform of arbitrary order in time-series databases. Moving average transform reduces the effect of noise and has been used in many areas such as econometrics since it is useful in finding the overall trends. The proposed algorithm extends the existing subsequence matching algorithm proposed by Faloutsos et al. (SUB94 in short). If we applied the algorithm without any extension, we would have to generate an index for each moving average order and would have serious storage and CPU time overhead. In this paper we tackle the problem using the notion of index interpolation. Index interpolation is defined as a searching method that uses one or more indexes generated for a few selected cases and performs searching for all the cases satisfying some criteria. The proposed algorithm, which is based on index interpolation, can use only one index for a pre-selected moving average order k and performs subsequence matching for arbitrary order m ( k). We prove that the proposed algorithm causes no false dismissal. The proposed algorithm can also use more than one index to improve search performance. The algorithm works better with smaller selectivities. For selectivities less than 10-2, the degradation of search performance compared with the fully-indexed case--which is equivalent to SUB94--is no more than 33.0% when one index is used, and 17.2% when two indexes are used. Since the queries with smaller selectivities are much more frequent in general database applications, the proposed algorithm is suitable for practical situations.
A suffix tree is widely adopted for indexing genome sequences. While supporting highly efficient search, the suffix tree has a few shortcomings such as very large size and very long construction time. In this paper, we propose a very fast parallel algorithm to construct a disk-based suffix tree for human genome sequences. Our algorithm constructs a suffix array for part of the suffixes in the human genome sequence and then converts it into a suffix tree very quickly. It outperformed the previous algorithms by Loh et al. and Barsky et al. by up to 2.09 and 3.04 times, respectively.
Woong-Kee LOH Yang-Sae MOON Young-Ho PARK
Due to the recent technical advances, GPUs are used for general applications as well as screen display. Many research results have been proposed to the performance of previous CPU-based algorithms by a few hundred times using the GPUs. In this paper, we propose a density-based clustering algorithm called GSCAN, which reduces the number of unnecessary distance computations using a grid structure. As a result of our experiments, GSCAN outperformed CUDA-DClust [2] and DBSCAN [3] by up to 13.9 and 32.6 times, respectively.