首页> 美国卫生研究院文献>Journal of Clinical and Translational Science >3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data
【2h】

3145 An Evaluation of Machine Learning and Traditional Statistical Methods for Discovery in Large-Scale Translational Data

机译:3145对机器学习和传统统计方法的评估以发现大规模翻译数据

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

OBJECTIVES/SPECIFIC AIMS: To examine and compare the claims in Bzdok, Altman, and Brzywinski under a broader set of conditions by using unbiased methods of comparison. To explore how to accurately use various machine learning and traditional statistical methods in large-scale translational research by estimating their accuracy statistics. Then we will identify the methods with the best performance characteristics. METHODS/STUDY POPULATION: We conducted a simulation study with a microarray of gene expression data. We maintained the original structure proposed by Bzdok, Altman, and Brzywinski. The structure for gene expression data includes a total of 40 genes from 20 people, in which 10 people are phenotype positive and 10 are phenotype negative. In order to find a statistical difference 25% of the genes were set to be dysregulated across phenotype. This dysregulation forced the positive and negative phenotypes to have different mean population expressions. Additional variance was included to simulate genetic variation across the population. We also allowed for within person correlation across genes, which was not done in the original simulations. The following methods were used to determine the number of dysregulated genes in simulated data set: unadjusted p-values, Benjamini-Hochberg adjusted p-values, Bonferroni adjusted p-values, random forest importance levels, neural net prediction weights, and second-generation p-values. RESULTS/ANTICIPATED RESULTS: Results vary depending on whether a pre-specified significance level is used or the top 10 ranked values are taken. When all methods are given the same prior information of 10 dysregulated genes, the Benjamini-Hochberg adjusted p-values and the second-generation p-values generally outperform all other methods. We were not able to reproduce or validate the finding that random forest importance levels via a machine learning algorithm outperform classical methods. Almost uniformly, the machine learning methods did not yield improved accuracy statistics and they depend heavily on the a priori chosen number of dysregulated genes. DISCUSSION/SIGNIFICANCE OF IMPACT: In this context, machine learning methods do not outperform standard methods. Because of this and their additional complexity, machine learning approaches would not be preferable. Of all the approaches the second-generation p-value appears to offer significant benefit for the cost of a priori defining a region of trivially null effect sizes. The choice of an analysis method for large-scale translational data is critical to the success of any statistical investigation, and our simulations clearly highlight the various tradeoffs among the available methods.
机译:目标/特定目的:通过使用无偏比较方法,在更广泛的条件下检查和比较Bzdok,Altman和Brzywinski中的索赔。通过估计准确性和准确性,探索如何在大规模翻译研究中准确地使用各种机器学习和传统统计方法。然后,我们将确定具有最佳性能特征的方法。方法/研究人群:我们使用基因表达数据微阵列进行了模拟研究。我们保留了Bzdok,Altman和Brzywinski提出的原始结构。基因表达数据的结构包括来自20个人的40个基因,其中10个人为表型阳性,而10个人为表型阴性。为了找到统计差异,将25%的基因设置为跨表型失调。这种失调迫使正和负表型具有不同的平均群体表达。还包括其他方差,以模拟整个群体的遗传变异。我们还允许跨基因的人内关联,这在原始模拟中是没有做到的。以下方法用于确定模拟数据集中失调基因的数量:未调整的p值,本杰米尼-霍奇伯格调整的p值,邦费罗尼调整的p值,随机森林重要性水平,神经网络预测权重和第二代p值。结果/预期结果:结果取决于使用预先指定的显着性水平还是采用排名前10位的值。当所有方法都具有10个失调基因的相同先验信息时,Benjamini-Hochberg调整的p值和第二代p值通常优于所有其他方法。我们无法通过机器学习算法重现或证实随机森林重要性水平优于传统方法的发现。几乎一致地,机器学习方法并不能提高准确性的统计数据,它们很大程度上取决于先验选择的失调基因的数量。讨论/意义的影响:在这种情况下,机器学习方法并不优于标准方法。因此,机器学习方法将不是可取的。在所有方法中,第二代p值似乎对先验定义零效应大小区域的成本提供了明显的好处。选择用于大型翻译数据的分析方法对于任何统计调查的成功都是至关重要的,而且我们的模拟清楚地表明了可用方法之间的各种权衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号