首页> 美国卫生研究院文献>PLoS Clinical Trials >On the Reconstruction of Text Phylogeny Trees: Evaluation and Analysis of Textual Relationships
【2h】

On the Reconstruction of Text Phylogeny Trees: Evaluation and Analysis of Textual Relationships

机译:文本系统树的重建:文本关系的评估与分析

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Over the history of mankind, textual records change. Sometimes due to mistakes during transcription, sometimes on purpose, as a way to rewrite facts and reinterpret history. There are several classical cases, such as the logarithmic tables, and the transmission of antique and medieval scholarship. Today, text documents are largely edited and redistributed on the Web. Articles on news portals and collaborative platforms (such as Wikipedia), source code, posts on social networks, and even scientific publications or literary works are some examples in which textual content can be subject to changes in an evolutionary process. In this scenario, given a set of near-duplicate documents, it is worthwhile to find which one is the original and the history of changes that created the whole set. Such functionality would have immediate applications on news tracking services, detection of plagiarism, textual criticism, and copyright enforcement, for instance. However, this is not an easy task, as textual features pointing to the documents’ evolutionary direction may not be evident and are often dataset dependent. Moreover, side information, such as time stamps, are neither always available nor reliable. In this paper, we propose a framework for reliably reconstructing text phylogeny trees, and seamlessly exploring new approaches on a wide range of scenarios of text reusage. We employ and evaluate distinct combinations of dissimilarity measures and reconstruction strategies within the proposed framework, and evaluate each approach with extensive experiments, including a set of artificial near-duplicate documents with known phylogeny, and from documents collected from Wikipedia, whose modifications were made by Internet users. We also present results from qualitative experiments in two different applications: text plagiarism and reconstruction of evolutionary trees for manuscripts (stemmatology).
机译:在人类历史上,文字记录发生了变化。有时是由于转录过程中的错误,有时是故意的,作为重写事实和重新解释历史的一种方式。有几种古典情况,例如对数表,以及古董和中世纪学术的传播。如今,文本文档已在网络上进行了大量编辑和重新分发。在新闻门户网站和协作平台(例如Wikipedia)上的文章,源代码,社交网络上的帖子,甚至科学出版物或文学作品,都是文本内容在进化过程中可能会发生变化的一些示例。在这种情况下,给定一组几乎重复的文档,值得找出哪一个是原始文档以及创建整个文档集的更改历史记录。例如,此类功能将在新闻跟踪服务,detection窃,文本批评和版权实施等方面具有直接的应用。但是,这并不是一件容易的事,因为指向文档进化方向的文本特征可能并不明显,并且通常取决于数据集。此外,诸如时间戳之类的辅助信息既不总是可用也不可靠。在本文中,我们提出了一个框架,用于可靠地重建文本系统树,并在各种文本重用场景下无缝地探索新方法。我们在拟议的框架内采用和评估相异性措施和重构策略的不同组合,并通过广泛的实验评估每种方法,包括一组已知系统发育的人工近重复文档以及从Wikipedia收集的文档,这些文档的修改均由互联网用户。我们还将介绍两种不同应用中的定性实验结果:文本text窃和手稿(词根学)进化树的重建。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号