A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce

Zhuo Tang; Kunkun Liu; Jinbo Xiao; Li Yang; Zheng Xiao

首页> 外文期刊>Concurrency and Computation >A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce

【24h】

A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce

机译：基于冗余消除和极点优化的并行k均值聚类算法（MapReduce）

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

When facing massive statistical data, the k-means algorithm is very difficult to satisfy the need of data processing as it lacks an effective parallel mechanism. This paper proposes an improved k-means algorithm (IMR-KCA) to conduct clustering analysis based on medical data employing MapReduce computing framework. Through analyzing the defects of vast redundancy in the traditional k-means algorithms, a selection model is firstly proposed to simplify the computations with multiple clustering centers. Based on several proposed theorems, we prove the correctness of this selection model. Second, this paper provides a method to calculate the distances from extreme points to central points, and the original Euclidean distance is replaced with Manhattan distance. For this simplification, a group of theorems are proposed to prove the correctness. Next, we provide a group of implementation algorithms to complete the parallelism of the clustering computation employing the MapReduce framework. Finally, the experimental results illustrate that IMR-KCA is more reliable and efficient than the direct parallelization of the traditional clustering algorithms based on MapReduce.

机译：当面对海量统计数据时，k-means算法由于缺乏有效的并行机制而很难满足数据处理的需求。提出了一种改进的k-means算法（IMR-KCA），利用MapReduce计算框架对医学数据进行聚类分析。通过分析传统k均值算法中大量冗余的缺陷，提出了一种选择模型来简化具有多个聚类中心的计算。基于提出的几个定理，我们证明了该选择模型的正确性。其次，本文提供了一种计算极端点到中心点距离的方法，并将原来的欧几里得距离替换为曼哈顿距离。为简化起见，提出了一组定理以证明其正确性。接下来，我们提供一组实现算法，以使用MapReduce框架完成聚类计算的并行性。最后，实验结果表明，IMR-KCA比基于MapReduce的传统聚类算法的直接并行化更为可靠和高效。

著录项

来源
《Concurrency and Computation》 |2017年第20期|e4109.1-e4109.18|共18页
作者
Zhuo Tang; Kunkun Liu; Jinbo Xiao; Li Yang; Zheng Xiao;
展开▼
作者单位

College of Information Science and Engineering, Hunan University, Hunan 410082, China;

College of Information Science and Engineering, Hunan University, Hunan 410082, China;

College of Information Science and Engineering, Hunan University, Hunan 410082, China;

Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation, School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha 410004, P. R. China;

College of Information Science and Engineering, Hunan University, Hunan 410082, China;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
clustering algorithms; extreme point; k-means; MapReduce; redundant distance;

机译：聚类算法;极端点k均值MapReduce;冗余距离;

相似文献

外文文献
中文文献
专利

1. An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm [J] . Tanvir Habib Sardar, Zahid Ansari Future Computing and Informatics Journal . 2018,第2期

机译：基于并行K-means算法的文档聚类中MapReduce效率分析
2. Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering [J] . Zahid Ansari, Asif Afzal, Tanvir Habib Sardar Journal of The Institution of Engineers (India): Series B . 2019,第2期

机译：使用基于Hadoop MapReduce的并行K均值聚类的数据分类
3. A MapReduce-based parallel K-means clustering for large-scale CIM data verification [J] . Deng Chuang, Liu Yang, Xu Lixiong, Concurrency and computation: practice and experience . 2016,第11期

机译：基于MapReduce的并行K均值聚类用于大规模CIM数据验证
4. Genetic Algorithm Based Parallel K-Means Data Clustering Algorithm Using MapReduce Programming Paradigm on Hadoop Environment (GAPKCA) [C] . Sayer Alshammari, Maslina Binti Zolkepli, Rusli Bin Abdullah International Conference on Soft Computing and Data Mining . 2020

机译：基于遗传算法的并行k均值数据聚类算法使用MapReduce编程范例对Hadoop环境（Gapkca）
5. A K-means based watershed imaging segmentation algorithm for banana cluster quality inspection. [D] . Castillo Cepin, Gregorio Alfonso. 2016

机译：基于K均值的分水岭成像分割算法用于香蕉簇质量检测。
6. Big Data: A Parallel Particle Swarm Optimization-Back-Propagation Neural Network Algorithm Based on MapReduce [O] . Jianfang Cao, Hongyan Cui, Hao Shi, -1

机译：大数据：基于MapReduce的并行粒子群优化-反向传播神经网络算法
7. K-means Clustering Optimization Algorithm Based on MapReduce [O] . Zhihua Li, Xudong Song, Wenhui Zhu, 2015

机译：基于MapReduce的K-means聚类优化算法

A parallel k-means clustering algorithm based on redundance elimination and extreme points optimization employing MapReduce

摘要

著录项

相似文献

相关主题

期刊订阅