1-6hit |
Jun ZENG Brendan FLANAGAN Sachio HIROKAWA Eisuke ITO
Web page segmentation has a variety of benefits and potential web applications. Early techniques of web page segmentation are mainly based on machine learning algorithms and rule-based heuristics, which cannot be used for large-scale page segmentation. In this paper, we propose a formulated page segmentation method using visual semantics. Instead of analyzing the visual cues of web pages, this method utilizes three measures to formulate the visual semantics: layout tree is used to recognize the visual similar blocks; seam degree is used to describe how neatly the blocks are arranged; content similarity is used to describe the content coherent degree between blocks. A comparison experiment was done using the VIPS algorithm as a baseline. Experiment results show that the proposed method can divide a Web page into appropriate semantic segments.
Jun ZENG Feng LI Brendan FLANAGAN Sachio HIROKAWA
Content extraction from deep Web pages has received great attention in recent years. However, the increasingly complicated HTML structure of Web documents makes it more difficult to recognize the data records by only analyzing the HTML source code. In this paper, we propose a method named LTDE to extract data records from a deep Web page. Instead of analyzing the HTML source code, LTDE utilizes the visual features of data records in deep Web pages. A Web page is considered as a finite set of visual blocks. The data records are the visual blocks that have similar layout. We also propose a pattern recognizing method named layout tree to cluster the similar layout visual blocks. The weight of all clusters is calculated, and the visual blocks in the cluster that has the highest weight are chosen as the data records to be extracted. The experiment results show that LTDE has higher effectiveness and better robustness for Web data extraction compared to previous works.
Wentao LI Min GAO Hua LI Jun ZENG Qingyu XIONG Sachio HIROKAWA
Collaborative filtering (CF) has been widely used in recommender systems to generate personalized recommendations. However, recommender systems using CF are vulnerable to shilling attacks, in which attackers inject fake profiles to manipulate recommendation results. Thus, shilling attacks pose a threat to the credibility of recommender systems. Previous studies mainly derive features from characteristics of item ratings in user profiles to detect attackers, but the methods suffer from low accuracy when attackers adopt new rating patterns. To overcome this drawback, we derive features from properties of item popularity in user profiles, which are determined by users' different selecting patterns. This feature extraction method is based on the prior knowledge that attackers select items to rate with man-made rules while normal users do this according to their inner preferences. Then, machine learning classification approaches are exploited to make use of these features to detect and remove attackers. Experiment results on the MovieLens dataset and Amazon review dataset show that our proposed method improves detection performance. In addition, the results justify the practical value of features derived from selecting patterns.
Jie ZOU Ling XU Mengning YANG Xiaohong ZHANG Jun ZENG Sachio HIROKAWA
The bug reports expressed in natural language text usually suffer from vast, ambiguous and poorly written, which causes the challenge to the duplicate bug reports detection. Current automatic duplicate bug reports detection techniques have mainly focused on textual information and ignored some useful factors. To improve the detection accuracy, in this paper, we propose a new approach calls LNG (LDA and N-gram) model which takes advantages of the topic model LDA and word-based model N-gram. The LNG considers multiple factors, including textual information, semantic correlation, word order, contextual connections, and categorial information, that potentially affect the detection accuracy. Besides, the N-gram adopted in our LNG model is improved by modifying the similarity algorithm. The experiment is conducted under more than 230,000 real bug reports of the Eclipse project. In the evaluation, we propose a new evaluation metric, namely exact-accuracy (EA) rate, which can be used to enhance the understanding of the performance of duplicates detection. The evaluation results show that all the recall rate, precision rate, and EA rate of the proposed method are higher than treating them separately. Also, the recall rate is improved by 2.96%-10.53% compared to the state-of-art approach DBTM.
Xibin WANG Fengji LUO Chunyan SANG Jun ZENG Sachio HIROKAWA
With the rapid development of information and Web technologies, people are facing ‘information overload’ in their daily lives. The personalized recommendation system (PRS) is an effective tool to assist users extract meaningful information from the big data. Collaborative filtering (CF) is one of the most widely used personalized recommendation techniques to recommend the personalized products for users. However, the conventional CF technique has some limitations, such as the low accuracy of of similarity calculation, cold start problem, etc. In this paper, a PRS model based on the Support Vector Machine (SVM) is proposed. The proposed model not only considers the items' content information, but also the users' demographic and behavior information to fully capture the users' interests and preferences. An improved Particle Swarm Optimization (PSO) algorithm is also proposed to improve the performance of the model. The efficiency of the proposed method is verified by multiple benchmark datasets.
Zhuo JIANG Junhao WEN Jun ZENG Yihao ZHANG Xibin WANG Sachio HIROKAWA
The success of heuristic search in AI planning largely depends on the design of the heuristic. On the other hand, previous experience contains potential domain information that can assist the planning process. In this context, we have studied dynamic macro-based heuristic planning through action relationship analysis. We present an approach for analyzing the action relationship and design an algorithm that learns macros in solved cases. We then propose a dynamic macro-based heuristic that appropriately reuses the macros rather than immediately assigning them to domains. The above ideas are incorporated into a working planning system called Dynamic Macro-based Fast Forward planner. Finally, we evaluate our method in a series of experiments. Our method effectively optimizes planning since it reduces the result length by an average of 10% relative to the FF, in a time-economic manner. The efficiency is especially improved when invoking an action consumes time.