首页> 中文期刊> 《计算机技术与发展》 >基于Spark的关联规则挖掘算法并行化研究

基于Spark的关联规则挖掘算法并行化研究

         

摘要

关联规则挖掘是一项重要的数据挖掘任务,关联规则挖掘算法能从数据中挖掘出潜在的关联关系,其中Apriori算法是典型代表.Spark平台是一个分布式的基于内存的适合迭代计算的大数据框架.以提高强关联规则的挖掘效率为目标,设计了一种Apriori算法基于Spark的并行化方案.该方案利用Spark平台的分布式架构以及集群调度机制,将事务数据集分发给多个子节点,各子节点调用transformation操作求得局部候选项集及支持度,并存储于内存中;汇总节点中的局部候选项集产生全局候选项集和全局频繁项集;不断迭代,直到下一级候选项集不存在为止.性能测试实验结果表明,基于Spark平台的并行化Apriori算法可以有效地分析大型数据项集之间的频繁项集和提取强关联规则,具有较高的准确性和时效性.%Association rule mining is an important task of data mining. Association rule mining algorithm can excavate potential relationships from data, among which Apriori algorithm is a typical representative. The Spark platform is a distributed memory-based big data framework suitable for iterative computing. In order to improve the mining efficiency of strong association rules, we propose a parallelization scheme of Apriori algorithm based on Spark. The scheme utilizes distributed architecture and cluster scheduling mechanism of the Spark platform to distribute the transaction data set to multiple sub nodes. Each sub node invokes transformation operation to obtain local candidate itemsets and support degree, and stores them in memory. Local candidate itemsets in summary nodes generate global candidate itemsets and global frequent itemsets. The process is iterated until the next level candidate set does not exist. The performance test experiment shows that the parallel Apriori algorithm based on the Spark platform can effectively analyze the frequent itemsets in large data itemsets and extract strong association rules, with high accuracy and timeliness.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号