首页> 外文会议>Advances in knowledge discovery and management >Statistically Valid Links and Anti-links Between Words and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection
【24h】

Statistically Valid Links and Anti-links Between Words and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection

机译:单词之间和文档之间的统计有效链接和反链接:将TourneBool随机检验应用于路透社收藏

获取原文
获取原文并翻译 | 示例

摘要

Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects x attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative "counter-relations" between words, i.e. words which "steer clear" one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.
机译:邻域是数据挖掘中的中心概念,已经实现了很多定义,这些定义主要基于几何或拓扑考虑。我们在这里提出邻域的统计定义:我们的TourneBool随机测试处理一个对象x属性二进制表,以便确定哪些属性间关系是偶然的,哪些是有意义的,而无需任何预定义的统计模型,同时考虑到解释经验分布。这样就得到了一个健壮且经过统计验证的图形。我们对一个路透社测试语料库之一进行了全面实验。我们通过一系列指标(例如聚类系数,程度分布和相关性,聚类模块性和大小分布)来表征结果词图。这种过程产生了另一种图形结构:一种在词之间传递负面的“反关系”,即彼此“避开”的词。我们以相同的方式描述反关系图。最后,我们生成了两个有效的文档图(即链接和反链接),并考虑了路透社文档类别对其进行了评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号