首页> 外文会议>International Conference on Big Data Analytics >Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics
【24h】

Parallel Evolving Clustering Method for Big Data Analytics Using Apache Spark: Applications to Banking and Physics

机译:使用Apache Spark使用Apache Spark的Big Data Analytics的并行群集方法:应用于银行和物理

获取原文

摘要

A novel parallel implementation of the Evolving Clustering Method (ECM) is proposed in this paper. The original serial version of the ECM is the clustering method which computes online and with a single-pass. The parallel version (Parallel ECM or PECM) is implemented in the Apache Spark framework, which makes it work in real time. The parallelization of the algorithm aims to handle a dataset with large volume. Many of the extant clustering algorithms do not involve a parallel one-pass method. The proposed method addresses this shortcoming. Its effectiveness is demonstrated on a credit card fraud dataset (with size 297 MB), and a Higgs dataset was taken from Physics pertaining to particle detectors in the accelerator (with size 1.4 GB). The experimental setup included a cluster of 10 machines having 32 GB RAM each with Hadoop Distributed File System (HDFS) and Spark computational environment. A remarkable achievement of this research is a dramatic reduction in computational time compared to the serial version of the ECM. In future, the PECM shall be hybridized with other machine learning algorithms for solving large-scale regression and classification problems.
机译:本文提出了一种新的并行实现演化聚类方法(ECM)。 ECM的原始串行版本是在线计算的群集方法,并使用单次通过。并行版本(并行ECM或PECM)在Apache Spark框架中实现,这使得它实时工作。算法的并行化旨在处理具有大容量的数据集。许多远端聚类算法不涉及并行单通方法。该方法解决了这种缺点。它的有效性在信用卡欺诈数据集(尺寸为297 MB)上,并从加速器中的粒子检测器(具有尺寸为1.4 GB),从物理学中取出HIGGS数据集。实验设置包括具有32 GB RAM的10台机器集群,每个机器都有Hadoop分布式文件系统(HDFS)和火花计算环境。与ECM的序列版相比,该研究的显着成就是计算时间的显着降低。将来,PECM应与其他机器学习算法杂交,以解决大规模回归和分类问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号