首页> 外文期刊>Frontiers in Physics >Rank Dynamics of Word Usage at Multiple Scales
【24h】

Rank Dynamics of Word Usage at Multiple Scales

机译:单词使用量在多个尺度上的排名动态

获取原文
           

摘要

The recent dramatic increase in online data availability has allowed researchers to explore human culture with unprecedented detail, such as the growth and diversification of language. In particular, it provides statistical tools to explore whether word use is similar across languages, and if so, whether these generic features appear at different scales of language structure. Here we use the Google Books $N$-grams dataset to analyze the temporal evolution of word usage in several languages. We apply measures proposed recently to study rank dynamics, such as the diversity of $N$-grams in a given rank, the probability that an $N$-gram changes rank between successive time intervals, the rank entropy, and the rank complexity. Using different methods, results show that there are generic properties for different languages at different scales, such as a core of words necessary to minimally understand a language. We also propose a null model to explore the relevance of linguistic structure across multiple scales, concluding that $N$-gram statistics cannot be reduced to word statistics. We expect our results to be useful in improving text prediction algorithms, as well as in shedding light on the large-scale features of language use, beyond linguistic and cultural differences across human populations.
机译:在线数据可用性最近的急剧增长使研究人员能够以前所未有的细节探索人类文化,例如语言的增长和多样化。特别是,它提供了统计工具,以探讨各种语言中的单词使用是否相似,如果存在,这些通用特征是否以不同的语言结构规模出现。在这里,我们使用Google图书$ N $ -grams数据集来分析几种语言中单词使用的时间演变。我们应用最近提出的措施来研究等级动态,例如给定等级中$ N $ -grams的多样性,$ N $ -gram在连续时间间隔之间变化等级的概率,等级熵和等级复杂度。使用不同的方法,结果表明,不同规模的不同语言都具有通用属性,例如以最低限度地理解某种语言所必需的单词核心。我们还提出了一个空模型来探讨跨多个尺度的语言结构的相关性,得出结论:$ N $ -gram统计不能简化为单词统计。我们希望我们的结果将有助于改善文本预测算法,并有助于阐明语言使用的大规模特征,而不仅仅是人类在语言和文化上的差异。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号