首页> 美国卫生研究院文献>other >Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript
【2h】

Probing the Statistical Properties of Unknown Texts: Application to the Voynich Manuscript

机译:探究未知文本的统计特性:在伏尼契手稿中的应用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

While the use of statistical physics methods to analyze large corpora has been useful to unveil many patterns in texts, no comprehensive investigation has been performed on the interdependence between syntactic and semantic factors. In this study we propose a framework for determining whether a text (e.g., written in an unknown alphabet) is compatible with a natural language and to which language it could belong. The approach is based on three types of statistical measurements, i.e. obtained from first-order statistics of word properties in a text, from the topology of complex networks representing texts, and from intermittency concepts where text is treated as a time series. Comparative experiments were performed with the New Testament in 15 different languages and with distinct books in English and Portuguese in order to quantify the dependency of the different measurements on the language and on the story being told in the book. The metrics found to be informative in distinguishing real texts from their shuffled versions include assortativity, degree and selectivity of words. As an illustration, we analyze an undeciphered medieval manuscript known as the Voynich Manuscript. We show that it is mostly compatible with natural languages and incompatible with random texts. We also obtain candidates for keywords of the Voynich Manuscript which could be helpful in the effort of deciphering it. Because we were able to identify statistical measurements that are more dependent on the syntax than on the semantics, the framework may also serve for text analysis in language-dependent applications.
机译:尽管使用统计物理方法分析大型语料库有助于揭示文本中的许多模式,但尚未对句法和语义因素之间的相互依赖性进行全面研究。在这项研究中,我们提出了一个框架,用于确定文本(例如,以未知字母书写的文本)是否与自然语言兼容以及该语言可能属于哪种语言。该方法基于三种类型的统计度量,即从文本中单词属性的一阶统计,从代表文本的复杂网络的拓扑结构以及从将文本视为时间序列的间歇概念中获得的统计度量。比较实验是用15种不同语言的《新约》以及英语和葡萄牙语的不同书籍进行的,目的是量化不同度量对语言和书中讲述故事的依赖性。被发现在区分真实文本和混排文本时提供信息的度量标准包括单词的分类性,程度和选择性。作为说明,我们分析了一种未破译的中世纪手抄本,即《 Voynich手抄本》。我们证明它主要与自然语言兼容,而与随机文本不兼容。我们还将获得伏尼契手稿关键词的候选者,这可能有助于解密。因为我们能够识别统计度量,而统计度量更依赖于语法而不是语义,所以该框架还可以用于依赖语言的应用程序中的文本分析。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号