首页> 外文期刊>International journal on digital libraries >A crowdsourcing approach to construct mono-lingual plagiarism detection corpus
【24h】

A crowdsourcing approach to construct mono-lingual plagiarism detection corpus

机译:一种构建单舌抄袭检测语料库的众包方法

获取原文
获取原文并翻译 | 示例
           

摘要

Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.
机译:剽窃检测探测文本文件中的抄袭碎片。在线图书馆中的数字文档的可用性使抄袭更容易,另一方面是通过自动抄袭检测系统轻松检测到的。需要大规模的抄袭案例,以各种抄袭病例评估不同语言的不同检测方法。剽窃检测Corpora在评估和调整剽窃检测系统方面发挥着重要作用。尽管重要性,但很少有基层为低资源语言开发。在本文中,我们提出了一个波斯抄袭检测语料库的Hamta。为了模拟抄袭的真正案例,手动释义文本用于编译语料库。为了获得手动抄袭病例,开发了一个众群平台,并要求人群工人解释文本的碎片,以模拟抄袭的真正案例。此外,人工方法用于通过自动生成文本重复使用的情况来扩展所提出的语料库。评估结果表明拟议的语料库和潘先列最先进的英语抄袭检测语料库之间的高相关性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号