Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

N. Rezaeian; G.M. Novikova

首页> 外文期刊>Procedia Computer Science >Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

【24h】

Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

机译：通过使用指纹算法Simhash检测俄语文档中的近重复项

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Plagiarism is one of the major problems in the age of communication. In many languages such as English, this issue is seriously of high importance and many powerful devices have been invented to prevent this problem from occurring. This article aims at discovering plagiarism in Russian texts based on fingerprint algorithm. The fingerprint algorithms have high speeds in finding out the plagiarism due to the compact features it creates and purely because of the comparison of these properties between original documents and dubious documents. Increasing the power and accuracy of plagiarism discovery, there must be elimination of general words and word rooting before pre-processing applications such as words separation, numbers replacement, and homogenization. In this article, four Simhash algorithms have been used. The implementation of these algorithms confirmed on 800 articles with the scientific topics was found to have satisfactory results.

机译：gi窃是交流时代的主要问题之一。在许多语言（例如英语）中，此问题非常重要，并且已经发明了许多功能强大的设备来防止此问题的发生。本文旨在基于指纹算法发现俄文抄袭。指纹算法由于其创建的紧凑功能以及纯粹由于原始文档和可疑文档之间的这些属性的比较而具有很高的识别finding窃的速度。为了提高窃行为的发现能力和准确性，在进行诸如单词分离，数字替换和均质化之类的预处理应用之前，必须消除一般单词和单词词根。在本文中，使用了四种Simhash算法。这些算法的实现在800篇具有科学主题的文章中得到证实，结果令人满意。

著录项

来源
《Procedia Computer Science》 |2017年第1期|共5页
作者
N. Rezaeian; G.M. Novikova;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection [J] . Phuc-TranHo, Sung-RyulKim International Journal of Distributed Sensor Networks . 2014,第3期

机译：基于指纹的近重复文档检测及其在SNS垃圾邮件检测中的应用
2. Detecting near-duplicate text documents with a hybrid approach [J] . Cihan Varol, Sairam Hari Journal of Information Science . 2015,第4期

机译：使用混合方法检测几乎重复的文本文档
3. Detecting near-duplicate documents using sentence-level features and supervised learning [J] . Yung-Shen Lin, Ting-Yi Liao, Shie-Jue Lee Expert Systems with Application . 2013,第5期

机译：使用句子级功能和监督学习来检测几乎重复的文档
4. A Novel Approach for Detecting Near-Duplicate Web Documents by Considering Images, Text, Size of the Document and Domain [C] . M. Bhavani, V. A. Narayana, Gaddameedi Sreevani International Conference on Communications and Cyber Physical Engineering . 2020

机译：通过考虑图像，文本，文档和域的大小来检测近重Web文档的新方法
5. "Master of many tongues": The Russian Academy Dictionary (1789--1794) as a socio -historical document. [D] . Lefloch, Myriam. 2002

机译：“多种语言的大师”：俄罗斯科学院字典（1789--1794），是一种社会历史文献。
6. An improved Four-Russians method and sparsified Four-Russians algorithm for RNA folding [O] . Yelena Frid, Dan Gusfield 2016

机译：RNA折叠的改进的四俄罗斯方法和稀疏四俄罗斯算法
7. Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm [O] . Yi Yu, Zijian Hu, Yuzhu Zhang 2015

机译：基于SimHash算法的大规模文献重复数据删除技术研究

Detecting Near-duplicates in Russian Documents through Using Fingerprint Algorithm Simhash

摘要

著录项

相似文献

相关主题

期刊订阅