...
首页> 外文期刊>Procedia Computer Science >Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash
【24h】

Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

机译:通过使用指纹算法Simhash检测俄语文档中的近重复项

获取原文
           

摘要

Plagiarism is one of the major problems in the age of communication. In many languages such as English, this issue is seriously of high importance and many powerful devices have been invented to prevent this problem from occurring. This article aims at discovering plagiarism in Russian texts based on fingerprint algorithm. The fingerprint algorithms have high speeds in finding out the plagiarism due to the compact features it creates and purely because of the comparison of these properties between original documents and dubious documents. Increasing the power and accuracy of plagiarism discovery, there must be elimination of general words and word rooting before pre-processing applications such as words separation, numbers replacement, and homogenization. In this article, four Simhash algorithms have been used. The implementation of these algorithms confirmed on 800 articles with the scientific topics was found to have satisfactory results.
机译:gi窃是交流时代的主要问题之一。在许多语言(例如英语)中,此问题非常重要,并且已经发明了许多功能强大的设备来防止此问题的发生。本文旨在基于指纹算法发现俄文抄袭。指纹算法由于其创建的紧凑功能以及纯粹由于原始文档和可疑文档之间的这些属性的比较而具有很高的识别finding窃的速度。为了提高窃行为的发现能力和准确性,在进行诸如单词分离,数字替换和均质化之类的预处理应用之前,必须消除一般单词和单词词根。在本文中,使用了四种Simhash算法。这些算法的实现在800篇具有科学主题的文章中得到证实,结果令人满意。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号