A Communication-Efficient Parallel Algorithm for Decision Tree

机译：一种高效通信的决策树并行算法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Decision tree (and its extensions such as Gradient Boosting Decision Trees and Random Forest) is a widely used machine learning algorithm, due to its practical effectiveness and model interpretability. With the emergence of big data, there is an increasing need to parallelize the training process of decision tree. However, most existing attempts along this line suffer from high communication costs. In this paper, we propose a new algorithm, called Parallel Voting Decision Tree (PV-Tree), to tackle this challenge. After partitioning the training data onto a number of (e.g., M) machines, this algorithm performs both local voting and global voting in each iteration. For local voting, the top-κ attributes are selected from each machine according to its local data. Then, globally top-2κ attributes are determined by a majority voting among these local candidates. Finally, the full-grained histograms of the globally top-2κ attributes are collected from local machines in order to identify the best (most informative) attribute and its split point. PV-Tree can achieve a very low communication cost (independent of the total number of attributes) and thus can scale out very well. Furthermore, theoretical analysis shows that this algorithm can learn a near optimal decision tree, since it can find the best attribute with a large probability. Our experiments on real-world datasets show that PV-Tree significantly outperforms the existing parallel decision tree algorithms in the trade-off between accuracy and efficiency.

机译：决策树（及其扩展，例如Gradient Boosting决策树和Random Forest）由于其实用性和模型可解释性而被广泛使用。随着大数据的出现，越来越需要并行化决策树的训练过程。但是，沿着这条线的大多数现有尝试都遭受了高昂的通信成本。在本文中，我们提出了一种称为并行投票决策树（PV-Tree）的新算法来应对这一挑战。在将训练数据划分到多个（例如，M）机器之后，该算法在每次迭代中执行本地投票和全局投票。对于本地投票，将根据其本地数据从每台计算机中选择top-κ属性。然后，全局top-2κ属性由这些本地候选者之间的多数投票决定。最后，从本地计算机收集全局top-2κ属性的完整直方图，以识别最佳（信息量最大）属性及其分割点。 PV-Tree可以实现非常低的通信成本（与属性总数无关），因此可以很好地扩展。此外，理论分析表明，该算法可以大概率地找到最佳属性，因此可以学习接近最优的决策树。我们在真实数据集上的实验表明，PV-Tree在准确性和效率之间的取舍方面明显优于现有的并行决策树算法。

著录项

来源
《Annual conference on Neural Information Processing Systems》|2016年|1279-1287|共9页
会议地点
作者
Qi Meng; Guolin Ke; Taifeng Wang; Wei Chen; Qiwei Ye; Zhi-Ming Ma; Tie-Yan Liu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Communication-efficient algorithms for parallel latent Dirichlet allocation [J] . Yan Jian-Feng, Zeng Jia, Gao Yang, Soft computing: A fusion of foundations, methodologies and applications . 2015,第1期

机译：并行潜在Dirichlet分配的高效通信算法
2. Parallel Formulations of Decision-Tree Classification Algorithms [J] . Anurag Srivastava, Eui-Hong Han, Vipin Kumar, Data mining and knowledge discovery . 1999,第3期

机译：决策树分类算法的并行表示
3. Parallel Formulations of Decision-Tree Classification Algorithms [J] . Anurag Srivastava, Eui-Hong Han, Vipin Kumar, Data mining and knowledge discovery . 1999,第3期

机译：决策树分类算法的并行表示
4. A Communication-Efficient Parallel Algorithm for Decision Tree [C] . Qi Meng, Guolin Ke, Taifeng Wang, Annual conference on Neural Information Processing Systems . 2016

机译：决策树的通信有效的并行算法
5. Communication-efficient bulk synchronous parallel algorithms. [D] . Huang, Chun-Hsi. 2001

机译：具有通信效率的批量同步并行算法。
6. Chronic kidney disease diagnosis using decision tree algorithms [O] . Hamida Ilyas, Sajid Ali, Mahvish Ponum, 2021

机译：慢性肾病诊断使用决策树算法
7. Communication-Efficient Deterministic Parallel Algorithms for Planar Point Location and 2d Voronoi Diagram [O] . Mohamadou Diallo, Afonso Ferreira, Andrew Rau-chaplin 1998

机译：平面点定位和二维Voronoi图的高效通信确定性并行算法
8. Communication-Efficient Parallel Graph Algorithms [R] . Maggs, B. M. 1986

机译：通信高效并行图算法

A Communication-Efficient Parallel Algorithm for Decision Tree

摘要

著录项

相似文献

相关主题

期刊订阅