首页> 外文学位 >Finding multiple clustering structures in data, with applications to DNA microarrays.
【24h】

Finding multiple clustering structures in data, with applications to DNA microarrays.

机译:在数据中找到多个聚类结构,并将其应用于DNA微阵列。

获取原文
获取原文并翻译 | 示例

摘要

Cluster analysis is the art of discovering classes in data. Traditionally, the goal of cluster analysis has been to uncover the unknown clustering structure by partitioning the observations into a single set of clusters such that the observations within each cluster are more similar to one another than those assigned to different clusters. However, as the number of variables gets larger, it becomes increasingly unlikely for any pair of observations to be similar across all the variables simultaneously. In contrast, the observations tend to group better on small subsets of the variables. Moreover, different subsets of variables might induce different and potentially useful clustering structures of observations. In this work, the standard clustering problem of finding a single clustering structure of observations is first generalized to the problem of discovering multiple clustering structures and finding variables that induce them. Three dissimilarity measures based on entropy, empirical measures and interpoint-distance based graphs are proposed for clustering variables and their performance is compared to the widely used correlation-based dissimilarity. A procedure based on binning that makes the computation of the first two of these dissimilarity measures feasible is developed. We also propose a weighted distance two-way clustering method for discovering multiple clustering structures in the data and give a randomization test for similarity of clustering structures. The motivating application is to gene expression data.
机译:聚类分析是发现数据类别的艺术。传统上,聚类分析的目标是通过将观察结果划分为单个聚类集来发现未知的聚类结构,从而使每个聚类中的观察值与一个,而不是分配给不同群集的那些。但是,随着变量数量的增加,任何一对观测值同时在所有变量中相似的可能性越来越小。相反,观察结果倾向于在较小的变量子集上更好地分组。此外,变量的不同子集可能会引发观察结果的不同且可能有用的聚类结构。在这项工作中,首先将发现观测值的单个聚类结构的标准聚类问题概括为发现多个聚类结构并找到引起它们的变量的问题。针对聚类变量,提出了三种基于熵的差异度量,经验度量和基于点间距离的图,并将它们的性能与广泛使用的基于相关性的差异进行了比较。提出了一种基于分箱的程序,该程序使得这些相异性度量中的前两个度量的计算变得可行。我们还提出了一种加权距离双向聚类方法,用于发现数据中的多个聚类结构,并对聚类结构的相似性进行随机检验。激励性的应用是基因表达数据。

著录项

  • 作者

    Belitskaya, Ilana Yolyevna.;

  • 作者单位

    Stanford University.;

  • 授予单位 Stanford University.;
  • 学科 Statistics.
  • 学位 Ph.D.
  • 年度 2003
  • 页码 p.1319
  • 总页数 148
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 统计学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号