首页> 外文会议>Annual conference on Neural Information Processing Systems >A Communication-Efficient Parallel Algorithm for Decision Tree
【24h】

A Communication-Efficient Parallel Algorithm for Decision Tree

机译:一种高效通信的决策树并行算法

获取原文

摘要

Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called Parallel Voting Decision Tree (PV-Tree), to tackle this challenge. After partitioning the training data onto a number of (e.g., M) machines, this algorithm performs both local voting and global voting in each iteration. For local voting, the top-κ attributes are selected from each machine according to its local data. Then, globally top-2κ attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-2κ attributes are collected from local machines in order to identify the best (most informative) attribute and its split point. PV-Tree can achieve a very low communication cost (independent of the total number of attributes) and thus can scale out very well. Furthermore, theoretical analysis shows that this algorithm can learn a near optimal decision tree, since it can find the best attribute with a large probability. Our experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the trade-off between accuracy and efficiency.
机译:决策树(及其扩展,例如Gradient Boosting决策树和Random Forest)由于其实用性和模型可解释性而被广泛使用。随着大数据的出现,越来越需要并行化决策树的训练过程。但是,沿着这条线的大多数现有尝试都遭受了高昂的通信成本。在本文中,我们提出了一种称为并行投票决策树(PV-Tree)的新算法来应对这一挑战。在将训练数据划分到多个(例如,M)机器之后,该算法在每次迭代中执行本地投票和全局投票。对于本地投票,将根据其本地数据从每台计算机中选择top-κ属性。然后,全局top-2κ属性由这些本地候选者之间的多数投票决定。最后,从本地计算机收集全局top-2κ属性的完整直方图,以识别最佳(信息量最大)属性及其分割点。 PV-Tree可以实现非常低的通信成本(与属性总数无关),因此可以很好地扩展。此外,理论分析表明,该算法可以大概率地找到最佳属性,因此可以学习接近最优的决策树。我们在真实数据集上的实验表明,PV-Tree在准确性和效率之间的取舍方面明显优于现有的并行决策树算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号