1-3hit |
Leilei KONG Zhongyuan HAN Haoliang QI Zhimao LU
This paper addresses the issue of text matching for plagiarism detection. This task aims at identifying the matching plagiarism segments in a pair of suspicious document and its plagiarism source document. All the time, heuristic-based methods are mainly utilized to resolve this problem. But the heuristics rely on the experts' experiences and fail to integrate more features to detect the high obfuscation plagiarism matches. In this paper, a statistical machine learning approach, named the Ranking-based Text Matching Approach for Plagiarism Detection, is proposed to deal with the issues of high obfuscation plagiarism detection. The plagiarism text matching is formalized as a ranking problem, and a pairwise learning to rank algorithm is exploited to identify the most probable plagiarism matches for a given suspicious segment. Especially, the Meteor evaluation metrics of machine translation are subsumed by the proposed method to capture the lexical and semantic text similarity. The proposed method is evaluated on PAN12 and PAN13 text alignment corpus of plagiarism detection and compared to the methods achieved the best performance in PAN12, PAN13 and PAN14. Experimental results demonstrate that the proposed method achieves statistically significantly better performance than the baseline methods in all twelve document collections belonging to five different plagiarism categories. Especially at the PAN12 Artificial-high Obfuscation sub-corpus and PAN13 Summary Obfuscation plagiarism sub-corpus, the main evaluation metrics PlagDet of the proposed method are even 22% and 43% relative improvements than the baselines. Moreover, the efficiency of the proposed method is also better than that of baseline methods.
Leilei KONG Yong HAN Haoliang QI Zhongyuan HAN
Source retrieval is the primary task of plagiarism detection. It searches the documents that may be the sources of plagiarism to a suspicious document. The state-of-the-art approaches usually rely on the classical information retrieval models, such as the probability model or vector space model, to get the plagiarism sources. However, the goal of source retrieval is to obtain the source documents that contain the plagiarism parts of the suspicious document, rather than to rank the documents relevant to the whole suspicious document. To model the “partial matching” between documents, this paper proposes a Partial Matching Convolution Neural Network (PMCNN) for source retrieval. In detail, PMCNN exploits a sequential convolution neural network to extract the plagiarism patterns of contiguous text segments. The experimental results on PAN 2013 and PAN 2014 plagiarism source retrieval corpus show that PMCNN boosts the performance of source retrieval significantly, outperforming other state-of-the-art document models.
Leilei KONG Zhimao LU Zhongyuan HAN Haoliang QI
This paper addresses the issue of source retrieval in plagiarism detection. The task of source retrieval is retrieving all plagiarized sources of a suspicious document from a source document corpus whilst minimizing retrieval costs. The classification-based methods achieved the best performance in the current researches of source retrieval. This paper points out that it is more important to cast the problem as ranking and employ learning to rank methods to perform source retrieval. Specially, it employs RankBoost and Ranking SVM to obtain the candidate plagiarism source documents. Experimental results on the dataset of PAN@CLEF 2013 Source Retrieval show that the ranking based methods significantly outperforms the baseline methods based on classification. We argue that considering the source retrieval as a ranking problem is better than a classification problem.