1-6hit |
In the current era of data science, data quality has a significant and critical impact on business operations. This is no different for the meteorological data encountered in the field of meteorology. However, the conventional methods of meteorological data quality control mainly focus on error detection and null-value detection; that is, they only consider the results of the data output but ignore the quality problems that may also arise in the workflow. To rectify this issue, this paper proposes the Total Meteorological Data Quality (TMDQ) framework based on the Total Quality Management (TQM) perspective, especially considering the systematic nature of data warehousing and process focus needs. In practical applications, this paper uses the proposed framework as the basis for the development of a system to help meteorological observers improve and maintain the quality of meteorological data in a timely and efficient manner. To verify the feasibility of the proposed framework and demonstrate its capabilities and usage, it was implemented in the Tamsui Meteorological Observatory (TMO) in Taiwan. The four quality dimension indicators established through the proposed framework will help meteorological observers grasp the various characteristics of meteorological data from different aspects. The application and research limitations of the proposed framework are discussed and possible directions for future research are presented.
Woo-Lam KANG Hyeon-Gyu KIM Yoon-Joon LEE
This paper presents a method to reduce I/O cost in MapReduce when online analytical processing (OLAP) queries are used for data analysis. The proposed method consists of two basic ideas. First, to reduce network transmission cost, mappers are organized to receive only data necessary to perform a map task, not an entire set of input data. Second, to reduce storage consumption, only record IDs are stored for checkpointing, not the raw records. Experiments conducted with TPC-H benchmark show that the proposed method is about 40% faster than Hive, the well-known data warehouse solution for MapReduce, while reducing the size of data stored for checkpoining to about 80%.
Seokjin HONG Bongki MOON Sukho LEE
A range top-k query returns the topmost k records in the order set by a measure attribute within a specified region of multi-dimensional data. The range top-k query is a powerful tool for analysis in spatial databases and data warehouse environments. In this paper, we propose an algorithm to answer the query by selectively traversing an aggregate R-tree having MAX as the aggregate values. The algorithm can execute the query by accessing only a small part of the leaf nodes within a query region. Therefore, it shows good query performance regardless of the size of the query region. We suggest an efficient pruning technique for the priority queue, which reduces the cost of handling the priority queue, and also propose an efficient technique for leaf node organization to reduce the number of node accesses to execute the range top-k queries.
Materialized views, which are derived from base relations and stored in the database, offer opportunities for significant performance gain in query evaluation by providing quick access to the pre-computed data. A materialized view can be utilized in evaluating a query if it has pre-computed result of some part of the query plan. Although many approaches to utilizing materialized views in evaluating a query have been proposed, there exist several restrictions in selecting such views. This paper proposes new ways of utilizing materialized views in answering an aggregate query. Views including relations that are not referred to in the given query are utilized. Attributes missing from a view can be recovered under certain conditions. We identify the conditions where a view may be used in evaluating a query and present the algorithm to search for the most efficient query among the equivalent ones. We also report on a simulation based on the TPC-H and GRID databases. Simulation results show that our approach provides impressive performance improvements to the data warehousing environment where aggregate views are often pre-computed and materialized.
To speed up on-line analytical processing (OLAP), data warehouse, which is usually derived from operational databases, is introduced. When the operational databases happen to change, the data warehouse gets stale. To maintain the freshness of data warehouse, operational database changes need to be frequently and concurrently propagated into the data warehouse. However, if several update transactions are allowed to execute concurrently without an appropriate concurrency control, data inconsistency between data warehouse and operational databases could arise due to incorrect propagation of changes on the operational databases into the data warehouse. In this paper, we propose a new concurrency control scheme, which could execute a number of update transactions in a consistent way. Whenever an update transaction tries to update a data that is being used by OLAP transactions, our scheme allows the update transaction to create a new version of the data. To investigate the applicable areas of our scheme, its performance is evaluated by means of simulation approach. Our experimental results show that the proposed scheme enables OLAP transactions to continuously read a very fresh data without wasting a lot of time to find out an appropriate version of the data from the version pool.
The paper investigates several approaches for designing and implementing integration environments. Such an environment is developed for the purpose to allow cooperative interactions between distributed and heterogeneous systems. A possible approach to achieve system integration is to use the warehousing technology which engenders the development of data warehousing environments. These environments are information repositories that are available for queries and analysis. In order to manage efficiently a data warehouse, software agents enhanced with mobility mechanisms are introduced. A software agent is an autonomous entity having the abilities to collaborate with each other and to answer users' needs. Furthermore, to perform their operations software agents can migrate off their hosts and roam the network to gather relevant information. This research is part of the SAMDW project which aims at developing a new generation of data warehouses.