首页> 外文学位 >Fast nonparametric machine learning algorithms for high-dimensional massive data and applications.
【24h】

Fast nonparametric machine learning algorithms for high-dimensional massive data and applications.

机译:用于高维海量数据和应用程序的快速非参数机器学习算法。

获取原文
获取原文并翻译 | 示例

摘要

Nonparametric methods have become increasingly popular in statistics and probabilistic AI communities. One well-known nonparametric method is "nearest-neighbor." It uses the observations in the training set T closest in input space to a query q to form the prediction of q. Specifically, when k of the observations in T are considered, it is called k-nearest-neighbor (or k-NN).; Despite its simplicity, k-NN and its variants have been successful in many machine learning problems, including pattern recognition, text categorization, information retrieval, computational statistics, database and data mining. It is also used for estimating sample distributions and Bayes error.; However, k-NN and many related nonparametric methods remain hampered by their computational complexity. Many spatial methods, such as metric-trees, have been proposed to alleviate the computational cost, but the effectiveness of these methods decreases as the number of dimensions of feature vectors increases. From another direction, researchers are trying to develop ways to find approximate answers. The premise of this research is that in many cases it is not necessary to insist on the exact answers; instead, determining an approximate answer should be sufficient. In fact, some approximate methods show good performance in a number of applications, and some methods enjoy very good theoretical soundness. However, when facing hundreds or thousands dimensions, many algorithms do not work well in reality.; I propose four new spatial methods for fast k-NN and its variants, namely KNS2, KNS3, IOC and spill-tree. The first three algorithms are designed to speed up k-NN classification problems, and they all share the same insight that finding the majority class among the k-NN of q need not to explicitly find those k-nearest-neighbors. Spill-tree is designed for approximate k-NN search. By adapting metric-trees to a more flexible data structure, spill-tree is able to adapt to the distribution of data and it scales well even for huge high-dimensional data sets. Significant efficiency improvement has been observed comparing to LSH (localify sensitive hashing), the state of art approximate k-NN algorithm. We applied spill-tree to real-world three applications: shot video segmentation, drug activity detection and image clustering, which I will explain in the thesis.
机译:非参数方法在统计和概率AI社区中越来越流行。一种众所周知的非参数方法是“最近邻居”。它使用训练集T中输入空间中最接近查询q的观察值来形成q的预测。具体而言,当考虑T中的k个观测值时,称为k最近邻(或k-NN)。尽管简单,但k-NN及其变体在许多机器学习问题中都取得了成功,包括模式识别,文本分类,信息检索,计算统计,数据库和数据挖掘。它也用于估计样本分布和贝叶斯误差。但是,k-NN和许多相关的非参数方法仍然受其计算复杂性的困扰。已经提出了许多空间方法,例如度量树,以减轻计算成本,但是这些方法的有效性随着特征向量的维数增加而降低。从另一个方向来看,研究人员正在尝试找到寻找近似答案的方法。这项研究的前提是,在许多情况下,没有必要坚持确切的答案。相反,确定一个大概的答案就足够了。实际上,一些近似方法在许多应用中显示出良好的性能,并且某些方法在理论上具有很好的可靠性。但是,面对数百或数千个维度时,许多算法在现实中效果不佳。我为快速k-NN及其变体提出了四种新的空间方法,即KNS2,KNS3,IOC和溢出树。前三种算法旨在加快k-NN分类问题的速度,它们都具有相同的见解,即在q的k-NN中找到多数类不需要明确地找到那些k-最近邻居。溢出树旨在用于近似k-NN搜索。通过使度量树适应更灵活的数据结构,溢出树能够适应数据的分布,并且即使对于庞大的高维数据集也可以很好地扩展。与LSH(本地化敏感哈希)技术(最先进的近似k-NN算法)相比,效率得到了显着提高。我们将溢出树应用于现实世界中的三个应用程序:镜头视频分割,毒品活动检测和图像聚类,我将在论文中进行解释。

著录项

  • 作者

    Liu, Ting.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 124 p.
  • 总页数 124
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号