首页> 外文学位 >An empirical study of Classification and Regression Tree and Random Forests.
【24h】

An empirical study of Classification and Regression Tree and Random Forests.

机译:分类回归树和随机森林的实证研究。

获取原文
获取原文并翻译 | 示例

摘要

Data explosion and exploration fuels the demand for self-learning methodologies to extract hidden patterns from the training data set, and to provide predictive information for future data set. This thesis is concerned exclusively with Classification and Regression Tree (CART) and its successor Random Forests (RF), especially their classification aspect.; Unlike many other traditional classification approaches, such as K Nearest Neighbor, Discriminant Analysis, Neural Networks, CART can provide data structure information with the capability of interpreting the decision. With bagging and majority vote, RF combines a large number of random trees to achieve an accurate "black-box" model. Its accuracy is comparable to other famous classification methods such as Adboosting and Support Vector Machines, while RF is more stable.; After a comprehensive study, some extensions on CART and RF are proposed to derive a probability scoring system for improved classification accuracy. A well-known drawback of CART is that it is a hard classifier, which limits its usage on practical applications that require some confidence level or posterior probability estimation. In this thesis, we investigate several scoring methods and contribute the s-CART, a new version of CART with a built-in posterior probability scoring system. Analysis of both the traditional machine learning benchmark data sets and the newly emerged proteomic data sets show that s-CART has a better prediction performance than, and a competitive speed to, the traditional CART.; The Random Forests, an extension of CART where the final classification decision is made by taking the majority vote of multiple CART's generated via Bootstrap Resampling, is hailed as the most promising classifier developed to date. After intensive literature search and study to understand its mechanism and properties, we introduce a third randomness to RF---a random splitting method at each non-leaf node through the entire tree, and the resulting RF is entitled as Forest-RS. The advantage of this randomness is studied theoretically and numerically. Comparisons are made between RF-RS, and the traditional RF. Additionally, we studied the stability of RF, especially the diversity of randomness, which appears to be an untouched research field.; A major contribution of this thesis work is a rapid and free-of-platform software developed to implement the traditional CART (including both the classification tree and the regression tree), the traditional RF (including Forest RI and Forest-RC), as well as the newly developed s-CART, and RF-RS. Comparing to other known software, either public or commercial, it provides much more model information, more flexibility for parameter tuning and more advanced features such as variable ranking, missing data processing and robust study, etc. Most of all, it realizes the thoughts in this thesis, which is unique. Key Words: Classification and Regression Tree, Decision Tree, Random Forests, Resampling, Scoring.
机译:数据爆炸和探索推动了对自学习方法的需求,这些方法需要从训练数据集中提取隐藏模式,并为将来的数据集提供预测信息。本文只涉及分类回归树(CART)及其后继的随机森林(RF),尤其是它们的分类方面。与许多其他传统分类方法(例如K最近邻,判别分析,神经网络)不同,CART可以提供具有解释决策能力的数据结构信息。通过套票和多数表决,RF结合了大量随机树,以实现准确的“黑匣子”模型。它的准确性可与其他著名的分类方法(例如,Adboosting和Support Vector Machines)相提并论,而RF更稳定。经过全面的研究,提出了对CART和RF的一些扩展,以推导用于改进分类精度的概率评分系统。 CART的一个众所周知的缺点是它是一个硬分类器,这限制了它在需要一定置信度或后验概率估计的实际应用中的使用。在本文中,我们研究了几种评分方法,并贡献了s-CART,它是具有内置后验概率评分系统的CART的新版本。对传统机器学习基准数据集和新出现的蛋白质组学数据集的分析表明,s-CART的预测性能优于传统CART,并且具有竞争速度。作为CART的扩展,Random Forests被认为是迄今为止开发的最有前途的分类器,它是通过对通过Bootstrap重采样生成的多个CART进行多数表决而做出最终分类决定的。经过大量的文献搜索和研究以了解其机理和性质,我们将RF引入第三种随机性-一种遍及整棵树的每个非叶节点的随机拆分方法,并将所得的RF称为Forest-RS。从理论和数值上研究了这种随机性的优点。在RF-RS和传统RF之间进行了比较。此外,我们研究了RF的稳定性,尤其是随机性的多样性,这似乎是一个未曾涉及的研究领域。这项工作的主要贡献是开发了一种快速,无平台的软件,以实现传统的CART(包括分类树和回归树),传统的RF(包括Forest RI和Forest-RC)。作为新开发的s-CART和RF-RS。与其他已知的公共或商业软件相比,它提供了更多的模型信息,更多的参数调整灵活性以及更高级的功能,例如变量排名,缺少数据处理和可靠的研究等。最重要的是,它实现了这个论文,是独一无二的。关键词:分类回归树,决策树,随机森林,重采样,计分。

著录项

  • 作者

    Xu, Bin.;

  • 作者单位

    State University of New York at Stony Brook.;

  • 授予单位 State University of New York at Stony Brook.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 95 p.
  • 总页数 95
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 统计学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号