首页> 外文会议>2016 IEEE International Conference on Cloud Engineering Workshop >A Collective Communication Layer for the Software Stack of Big Data Analytics
【24h】

A Collective Communication Layer for the Software Stack of Big Data Analytics

机译:大数据分析软件堆栈的集体通信层

获取原文
获取原文并翻译 | 示例

摘要

The landscape of distributed computing is rapidly evolving, with computers exhibiting increasing processing capabilities with many-core architectures. Almost every field of science is now data driven and requires analysis of massive datasets. The algorithms for analytics such as machine learning can be used to discover properties of a given dataset and make predictions based on it. However, there is still a lack of simple and unified programming frameworks for these data intensive applications, and many existing efforts are designed with specialized means to speed up individual algorithms. In this thesis research, a distributed programming model, MapCollective, is defined so that it can be easily applied to many machine learning algorithms. Specifically, algorithms that fit the iterative computation model can be easily parallelized with a unique collective communication layer for efficient synchronization. In contrast to traditional parallelization strategies that focus on handling high volume input data, a lesser known challenge is that the shared model data between parallel workers, is equally high volume in multidimensions and required to be communicated continually during the entire execution. This extends the understanding of data aspects in computation from in-memory caching of input data (e.g. iterative MapReduce model) to fine-grained synchronization on model data (e.g. MapCollective model). A library called Harp is developed as a Hadoop plugin to demonstrate that sophisticated machine learning algorithms can be simply abstracted with the MapCollective model and conveniently developed on top of the MapReduce framework. K-means and Multi-Dimensional Scaling (MDS) are tested over 4096 threads on the IU Big Red II Supercomputer. The results show linear speedup with an increasing number of parallel units.
机译:分布式计算的格局正在迅速发展,计算机在多核体系结构中的处理能力不断提高。现在,几乎每个科学领域都以数据为驱动力,并且需要分析大量数据集。诸如机器学习之类的分析算法可用于发现给定数据集的属性并基于其进行预测。但是,对于这些数据密集型应用程序仍然缺乏简单而统一的编程框架,并且许多现有的工作都采用专门的方法进行设计以加快单个算法的速度。在本文的研究中,定义了一种分布式编程模型MapCollective,以便可以轻松地将其应用于许多机器学习算法。具体而言,可以轻松地将符合迭代计算模型的算法与唯一的集体通信层并行化,以实现高效同步。与专注于处理大量输入数据的传统并行化策略相比,鲜为人知的挑战是并行工作程序之间的共享模型数据在多维上具有同样高的数量,并且需要在整个执行过程中不断进行通信。这扩展了对计算方面数据的理解,从输入数据的内存缓存(例如迭代MapReduce模型)到模型数据(例如MapCollective模型)的细粒度同步。开发了一个称为Harp的库作为Hadoop插件,以证明可以使用MapCollective模型简单地抽象复杂的机器学习算法,并在MapReduce框架的顶部方便地进行开发。在IU Big Red II超级计算机上,对4096个线程进行了K均值和多维缩放(MDS)测试。结果表明,随着并行单元数量的增加,线性加速。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号