首页> 美国卫生研究院文献>other >A data-driven approach to estimating the number of clusters in hierarchical clustering
【2h】

A data-driven approach to estimating the number of clusters in hierarchical clustering

机译:一种数据驱动的方法来估计层次聚类中的聚类数量

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms.
机译:DNA微阵列和基因表达问题经常需要研究人员对其数据进行聚类,以更好地了解其结构。在群集数量未知的情况下,可以采用分层群集方法。但是,目前很少有用于确定数据中簇的真实数量的自动算法。我们提出了两种新方法(模式差异和最大差异),用于估计分层聚类框架中的聚类数量,以创建无需人工干预的全自动流程。使用模拟数据集和Biobase Gene ExpressionSet,将这些方法与已建立的肘部和间隙统计算法进行比较。我们还探讨了受交叉验证技术启发的数据混合过程。我们发现,最大差异方法的整体性能与多集群方案中的差距统计量相当或更高,并且以少量的计算成本即可达到该性能。这种方法也很好地响应了我们的混合程序,这为将来的研究打开了大门。我们得出结论,模式和最大差异方法都需要对其混合和交叉验证潜力进行进一步研究。考虑到它的准确性和执行时间,我们特别建议在多群集方案中使用最大差异方法,并提出它作为现有算法的替代方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号