This paper,based on the frequent item set mining research,by using Hadoop distributed computing framework,proposes a new algorithm named SubApr,which is a parallel algorithm based on Apriori.The new algorithm only needs to scan database twice,processed by assigning data to different Hadoop compute nodes and used Apriori characteristics to pruning on MapReduce.Comparing with the other similar algorithms,it can reduce the storage of data for each compute node,reducing output candidate set,effectively reduces the amount of data communication of large data sets generated during mining,which can improve the efficiency of parallel algorithms.The experimental result shows that the new algorithm is effective and feasible.%文中在频繁项目集挖掘研究的基础上,针对Hadoop分布式计算框架,提出了一种基于子集的Apriori并行改进算法SubApr.该算法扫描数据库两次,将分块数据分配给不同的Hadoop计算节点进行处理,利用Apriori特性并结合MapReduce框架自身特点进行剪枝.该算法与同类算法比较,可以减少各个计算节点的存储数据,达到减少候选项集输出,有效减少了大数据集挖掘过程中产生的大量数据通信,从而提高并行挖掘的效率.实验结果表明,该算法是有效且可行的.
展开▼