首页> 美国卫生研究院文献>other >Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining
【2h】

Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining

机译:大规模推理的基本原理:通过相关挖掘的插图

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

When can reliable inference be drawn in fue “Big Data” context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for “Big Data”. Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
机译:什么时候可以在“大数据”背景下进行可靠的推断?本文提出了一个框架,用于在相关挖掘的背景下回答这一基本问题,即对一般大规模推理的意义。在大规模数据应用(例如基因组学,连接学和生态信息学)中,fue数据集通常变量丰富但样本匮乏:这种情况下,获取的样本(统计重复)的数量n远小于观察到的变量的p数量(基因,神经元,体素或化学成分)。最近的许多工作都集中在理解所提出的“大数据”方法的计算复杂性上。但是,样品复杂性受到的关注相对较少,尤其是在样品大小n固定且尺寸p无限制增长的情况下。为了解决有限的差距,我们开发了一个统一的统计框架,该框架明确地量化了各种推理任务的样本复杂性。采样方式可以分为几类:1)经典渐近方式,其中烟气变量维是固定的,烟气样本大小达到无穷大; 2)混合渐近体制,其中可变维数和样本量都以可比的速度达到无穷大; 3)纯粹的高维渐近体制,其中可变维变为无穷大,样本大小固定。每个制度都有其利基,但只有后者制度适用于exa cale数据维度。我们说明了用于关联挖掘问题的高维框架,其中它是变量变量之间成对和部分相关的矩阵。相关挖掘出现在许多应用中,并将回归上下文包含为特例。我们基于高维学习率和样本复杂度的统一观点,针对不同的结构化协方差模型和不同的推理任务,展示了多种相关挖掘机制。

著录项

  • 期刊名称 other
  • 作者单位
  • 年(卷),期 -1(104),1
  • 年度 -1
  • 页码 93–110
  • 总页数 36
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号