...
首页> 外文期刊>Computer science >Quality-driven early stopping for explorative cluster analysis for big data
【24h】

Quality-driven early stopping for explorative cluster analysis for big data

机译:大数据勘探集群分析的质量驱动的早期停止

获取原文
获取原文并翻译 | 示例
           

摘要

Data analysis has become a critical success factor for companies in all areas. Hence, it is necessary to quickly gain knowledge from available datasets, which is becoming especially challenging in times of big data. Typical data mining tasks like cluster analysis are very time consuming even if they run in highly parallel environments like Spark clusters. To support data scientists in explorative data analysis processes, we need techniques to make data mining tasks even more efficient. To this end, we introduce a novel approach to stop clustering algorithms as early as possible while still achieving an adequate quality of the detected clusters. Our approach exploits the iterative nature of many cluster algorithms and uses a metric to decide after which iteration the mining task should stop. We present experimental results based on a Spark cluster using multiple huge datasets. The experiments unveil that our approach is able to accelerate the clustering up to a factor of more than 800 by obliterating many iterations which provide only little gain in quality. This way, we are able to find a good balance between the time required for data analysis and quality of the analysis results.
机译:数据分析已成为所有领域公司的重要成功因素。因此,有必要快速获得可用数据集的知识,这在大数据时变得特别具有挑战性。即使它们在像Spark Clusters这样的高度平行环境中运行,典型的数据挖掘任务也非常耗时。为了支持探索性数据分析过程中的数据科学家,我们需要技术使数据挖掘任务更加效率。为此,我们介绍了一种新颖的方法,尽早停止聚类算法,同时仍然实现了检测到的群集的足够质量。我们的方法利用许多群集算法的迭代性质,并使用指标来决定其中迭代挖掘任务应该停止。我们使用多个巨大数据集基于火花群的实验结果。实验揭示了我们的方法能够通过删除许多迭代来加速多达800多个超过800的聚类,这些迭代仅提供质量很少。这样,我们能够在数据分析和分析结果质量所需的时间之间找到良好的平衡。

著录项

  • 来源
    《Computer science》 |2019年第3期|129-140|共12页
  • 作者单位

    Institute for Parallel and Distributed Systems University of Stuttgart Universitatsstr. 38 70569 Stuttgart Germany;

    Institute for Parallel and Distributed Systems University of Stuttgart Universitatsstr. 38 70569 Stuttgart Germany;

    Institute for Parallel and Distributed Systems University of Stuttgart Universitatsstr. 38 70569 Stuttgart Germany;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Clustering; Big data; Early stop; Convergence; Regression;

    机译:聚类;大数据;早期停止;收敛;回归;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号