The join operation is one of the most expensive operations in relational database systems. So far many researchers have proposed several hash-based algorithms for the join operation. In a hash-based algorithm, a large relation is first partitioned into several clusters. When clusters overflow, that is, when the size of the cluster exceeds the size of main memory, the performance of hash-based algorithms degrade substantially. Previously we proposed the GN hash algorithm which is robust in the presence of overflown clusters. The GN hash join algorithm combines the Grace hash join and hash-based nested-loop join algorithms. We analyze the performance of the GN hash join algorithm when applied to relations with a non-uniform Zipf-like data distribution. The performance is compared with other hash-based join algorithms: Grace, Hybrid, nested-loop, and simple hash join. The GN hash join algorithm is found to have higher performance on non-uniformly distributed relations. In this paper, the robustness of the GN hash algorithm from the point of choosing a run time method is verified. In the GN hash algorithm, the criterion for selecting a run time method from the two algorithm is determined by using the value calculated from the I/O cost formula of the two algorithms. This criterion cannot be guaranteed to be optimal under every data distribution, that is, the optimal criterion may change depending on the data distribution. When the data distribution is unknown, all data has to be repartitioned in order to get an accurate optimal criterion. However, from the view of choosing a method at run time, it is necessary for the GN hash algorithm to determine an appropriate criterion regardless of the data distribution. Thus, we inspect the criterion adopted in our algorithm under a simulation environment. From simulation results, we find that the range of the criterion is very wide under any data distribution and assure that the criterion determined with the assumption of a uniform data distribution can be used even when the data is highly skewed. Consequently, we can conclude that the GN hash algorithm which dynamically selects the nested-loop and Grace hash algorithms provides good performance in the presence of data skew and its performance is not sensitive to the criterion.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Miyuki NAKANO, Masaru KITSUREGAWA, "Examination of Criterion for Choosing a Run Time Method in GN Hash Join Algorithm" in IEICE TRANSACTIONS on Information,
vol. E79-D, no. 11, pp. 1561-1569, November 1996, doi: .
Abstract: The join operation is one of the most expensive operations in relational database systems. So far many researchers have proposed several hash-based algorithms for the join operation. In a hash-based algorithm, a large relation is first partitioned into several clusters. When clusters overflow, that is, when the size of the cluster exceeds the size of main memory, the performance of hash-based algorithms degrade substantially. Previously we proposed the GN hash algorithm which is robust in the presence of overflown clusters. The GN hash join algorithm combines the Grace hash join and hash-based nested-loop join algorithms. We analyze the performance of the GN hash join algorithm when applied to relations with a non-uniform Zipf-like data distribution. The performance is compared with other hash-based join algorithms: Grace, Hybrid, nested-loop, and simple hash join. The GN hash join algorithm is found to have higher performance on non-uniformly distributed relations. In this paper, the robustness of the GN hash algorithm from the point of choosing a run time method is verified. In the GN hash algorithm, the criterion for selecting a run time method from the two algorithm is determined by using the value calculated from the I/O cost formula of the two algorithms. This criterion cannot be guaranteed to be optimal under every data distribution, that is, the optimal criterion may change depending on the data distribution. When the data distribution is unknown, all data has to be repartitioned in order to get an accurate optimal criterion. However, from the view of choosing a method at run time, it is necessary for the GN hash algorithm to determine an appropriate criterion regardless of the data distribution. Thus, we inspect the criterion adopted in our algorithm under a simulation environment. From simulation results, we find that the range of the criterion is very wide under any data distribution and assure that the criterion determined with the assumption of a uniform data distribution can be used even when the data is highly skewed. Consequently, we can conclude that the GN hash algorithm which dynamically selects the nested-loop and Grace hash algorithms provides good performance in the presence of data skew and its performance is not sensitive to the criterion.
URL: https://global.ieice.org/en_transactions/information/10.1587/e79-d_11_1561/_p
Copy
@ARTICLE{e79-d_11_1561,
author={Miyuki NAKANO, Masaru KITSUREGAWA, },
journal={IEICE TRANSACTIONS on Information},
title={Examination of Criterion for Choosing a Run Time Method in GN Hash Join Algorithm},
year={1996},
volume={E79-D},
number={11},
pages={1561-1569},
abstract={The join operation is one of the most expensive operations in relational database systems. So far many researchers have proposed several hash-based algorithms for the join operation. In a hash-based algorithm, a large relation is first partitioned into several clusters. When clusters overflow, that is, when the size of the cluster exceeds the size of main memory, the performance of hash-based algorithms degrade substantially. Previously we proposed the GN hash algorithm which is robust in the presence of overflown clusters. The GN hash join algorithm combines the Grace hash join and hash-based nested-loop join algorithms. We analyze the performance of the GN hash join algorithm when applied to relations with a non-uniform Zipf-like data distribution. The performance is compared with other hash-based join algorithms: Grace, Hybrid, nested-loop, and simple hash join. The GN hash join algorithm is found to have higher performance on non-uniformly distributed relations. In this paper, the robustness of the GN hash algorithm from the point of choosing a run time method is verified. In the GN hash algorithm, the criterion for selecting a run time method from the two algorithm is determined by using the value calculated from the I/O cost formula of the two algorithms. This criterion cannot be guaranteed to be optimal under every data distribution, that is, the optimal criterion may change depending on the data distribution. When the data distribution is unknown, all data has to be repartitioned in order to get an accurate optimal criterion. However, from the view of choosing a method at run time, it is necessary for the GN hash algorithm to determine an appropriate criterion regardless of the data distribution. Thus, we inspect the criterion adopted in our algorithm under a simulation environment. From simulation results, we find that the range of the criterion is very wide under any data distribution and assure that the criterion determined with the assumption of a uniform data distribution can be used even when the data is highly skewed. Consequently, we can conclude that the GN hash algorithm which dynamically selects the nested-loop and Grace hash algorithms provides good performance in the presence of data skew and its performance is not sensitive to the criterion.},
keywords={},
doi={},
ISSN={},
month={November},}
Copy
TY - JOUR
TI - Examination of Criterion for Choosing a Run Time Method in GN Hash Join Algorithm
T2 - IEICE TRANSACTIONS on Information
SP - 1561
EP - 1569
AU - Miyuki NAKANO
AU - Masaru KITSUREGAWA
PY - 1996
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E79-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 1996
AB - The join operation is one of the most expensive operations in relational database systems. So far many researchers have proposed several hash-based algorithms for the join operation. In a hash-based algorithm, a large relation is first partitioned into several clusters. When clusters overflow, that is, when the size of the cluster exceeds the size of main memory, the performance of hash-based algorithms degrade substantially. Previously we proposed the GN hash algorithm which is robust in the presence of overflown clusters. The GN hash join algorithm combines the Grace hash join and hash-based nested-loop join algorithms. We analyze the performance of the GN hash join algorithm when applied to relations with a non-uniform Zipf-like data distribution. The performance is compared with other hash-based join algorithms: Grace, Hybrid, nested-loop, and simple hash join. The GN hash join algorithm is found to have higher performance on non-uniformly distributed relations. In this paper, the robustness of the GN hash algorithm from the point of choosing a run time method is verified. In the GN hash algorithm, the criterion for selecting a run time method from the two algorithm is determined by using the value calculated from the I/O cost formula of the two algorithms. This criterion cannot be guaranteed to be optimal under every data distribution, that is, the optimal criterion may change depending on the data distribution. When the data distribution is unknown, all data has to be repartitioned in order to get an accurate optimal criterion. However, from the view of choosing a method at run time, it is necessary for the GN hash algorithm to determine an appropriate criterion regardless of the data distribution. Thus, we inspect the criterion adopted in our algorithm under a simulation environment. From simulation results, we find that the range of the criterion is very wide under any data distribution and assure that the criterion determined with the assumption of a uniform data distribution can be used even when the data is highly skewed. Consequently, we can conclude that the GN hash algorithm which dynamically selects the nested-loop and Grace hash algorithms provides good performance in the presence of data skew and its performance is not sensitive to the criterion.
ER -