Measuring the amount of shared information between two documents is akey to address a number of Natural Language Processing (NLP) challenges such asInformation Retrieval (IR), Semantic Textual Similarity (STS), Sentiment Analysis(SA) and Plagiarism Detection (PD). In this paper, we report a plagiarism detectionsystem based on two layers of assessment: 1) Fingerprinting which simply comparesthe documents fingerprints to detect the verbatim reproduction; 2) Word embeddingwhich uses the semantic and syntactic properties of words to detect much morecomplicated reproductions. Moreover, Word Alignment (WA), Inverse DocumentFrequency (IDF) and Part-of-Speech (POS) weighting are applied on the examineddocuments to support the identification of words that are most descriptive in eachtextual unit. In the present work, we focused on Arabic documents and we evaluatedthe performance of the system on a data-set of holding three types of plagiarism:1) Simple reproduction (copy and paste); 2) Word and phrase shuffling; 3) Intelligentplagiarism including synonym substitution, diacritics insertion and paraphrasing.The results show a recall of 88% and a precision of 86%. Compared to the resultsobtained by the systems participating in the Arabic Plagiarism Detection SharedTask 2015, our system outperforms all of them with a plagiarism detection score(Plagdet) of 83%.
展开▼