1-7hit |
Rachasak SOMYANONTHANAKUL Thanaruk THEERAMUNKONG
Objective interestingness measures play a vital role in association rule mining of a large-scaled database because they are used for extracting, filtering, and ranking the patterns. In the past, several measures have been proposed but their similarities or relations are not sufficiently explored. This work investigates sixty-one objective interestingness measures on the pattern of A → B, to analyze their similarity and dissimilarity as well as their relationship. Three-probability patterns, P(A), P(B), and P(AB), are enumerated in both linear and exponential scales and each measure's values of those conditions are calculated, forming synthesis data for investigation. The behavior of each measure is explored by pairwise comparison based on these three-probability patterns. The relationship among the sixty-one interestingness measures has been characterized with correlation analysis and association rule mining. In the experiment, relationships are summarized using heat-map and association rule mined. As the result, selection of an appropriate interestingness measure can be realized using the generated heat-map and association rules.
Shohei IKEDA Akinori IHARA Raula Gaikovina KULA Kenichi MATSUMOTO
Contemporary software projects often utilize a README.md to share crucial information such as installation and usage examples related to their software. Furthermore, these files serve as an important source of updated and useful documentation for developers and prospective users of the software. Nonetheless, both novice and seasoned developers are sometimes unsure of what is required for a good README file. To understand the contents of README, we investigate the contents of 43,900 JavaScript packages. Results show that these packages contain common content themes (i.e., ‘usage’, ‘install’ and ‘license’). Furthermore, we find that application-specific packages more frequently included content themes such as ‘options’, while library-based packages more frequently included other specific content themes (i.e., ‘install’ and ‘license’).
Takashi WATANABE Akito MONDEN Zeynep YÜCEL Yasutaka KAMEI Shuji MORISAKI
Association rule mining discovers relationships among variables in a data set, representing them as rules. These are expected to often have predictive abilities, that is, to be able to predict future events, but commonly used rule interestingness measures, such as support and confidence, do not directly assess their predictive power. This paper proposes a cross-validation -based metric that quantifies the predictive power of such rules for characterizing software defects. The results of evaluation this metric experimentally using four open-source data sets (Mylyn, NetBeans, Apache Ant and jEdit) show that it can improve rule prioritization performance over conventional metrics (support, confidence and odds ratio) by 72.8% for Mylyn, 15.0% for NetBeans, 10.5% for Apache Ant and 0 for jEdit in terms of SumNormPre(100) precision criterion. This suggests that the proposed metric can provide better rule prioritization performance than conventional metrics and can at least provide similar performance even in the worst case.
Jaekwang KIM KwangHo YOON Jee-Hyong LEE
Clickstreams in users' navigation logs have various data which are related to users' web surfing. Those are visit counts, stay times, product types, etc. When we observe these data, we can divide clickstreams into sub-clickstreams so that the pages in a sub-clickstream share more contexts with each other than with the pages in other sub-clickstreams. In this paper, we propose a method which extracts more informative rules from clickstreams for web page recommendation based on genetic programming and association rules. First, we split clickstreams into sub-clickstreams by contexts for generating more informative rules. In order to split clickstreams in consideration of context, we extract six features from users' navigation logs. A set of split rules is generated by combining those features through genetic programming, and then informative rules for recommendation are extracted with the association rule mining algorithm. Through experiments, we verify that the proposed method is more effective than the other methods in various conditions.
Nichnan KITTIPHATTANABAWON Thanaruk THEERAMUNKONG Ekawit NANTAJEEWARAWAT
Recently, to track and relate news documents from several sources, association rule mining has been applied due to its performance and scalability. This paper presents an empirical investigation on how term representation basis, term weighting, and association measure affects the quality of relations discovered among news documents. Twenty four combinations initiated by two term representation bases, four term weightings, and three association measures are explored with their results compared to human judgment of three-level relations: completely related, somehow related, and unrelated relations. The performance evaluation is conducted by comparing the top-k results of each combination to those of the others using so-called rank-order mismatch (ROM). The experimental results indicate that a combination of bigram (BG), term frequency with inverse document frequency (TFIDF) and confidence (CONF), as well as a combination of BG, TFIDF and conviction (CONV), achieves the best performance to find the related documents by placing them in upper ranks with 0.41% ROM on top-50 mined relations. However, a combination of unigram (UG), TFIDF and lift (LIFT) performs the best by locating irrelevant relations in lower ranks (top-1100) with 9.63% ROM. A detailed analysis on the number of the three-level relations with regard to their rankings is also performed in order to examine the characteristic of the resultant relations. Finally, a discussion and an error analysis are given.
Abedelaziz MOHAISEN Nam-Su JHO Dowon HONG DaeHun NYANG
Privacy preserving association rule mining algorithms have been designed for discovering the relations between variables in data while maintaining the data privacy. In this article we revise one of the recently introduced schemes for association rule mining using fake transactions (fs). In particular, our analysis shows that the fs scheme has exhaustive storage and high computation requirements for guaranteeing a reasonable level of privacy. We introduce a realistic definition of privacy that benefits from the average case privacy and motivates the study of a weakness in the structure of fs by fake transactions filtering. In order to overcome this problem, we improve the fs scheme by presenting a hybrid scheme that considers both privacy and resources as two concurrent guidelines. Analytical and empirical results show the efficiency and applicability of our proposed scheme.
Tatsuya ASAI Kenji ABE Shinji KAWASOE Hiroshi SAKAMOTO Hiroki ARIMURA Setsuo ARIKAWA
In this paper, we consider a data mining problem for semi-structured data. Modeling semi-structured data as labeled ordered trees, we present an efficient algorithm for discovering frequent substructures from a large collection of semi-structured data. By extending the enumeration technique developed by Bayardo (SIGMOD'98) for discovering long itemsets, our algorithm scales almost linearly in the total size of maximal tree patterns contained in an input collection depending mildly on the size of the longest pattern. We also developed several pruning techniques that significantly speed-up the search. Experiments on Web data show that our algorithm runs efficiently on real-life datasets combined with proposed pruning techniques in the wide range of parameters.