A Collective Communication Layer for the Software Stack of Big Data Analytics

机译：大数据分析软件堆栈的集体通信层

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The landscape of distributed computing is rapidly evolving, with computers exhibiting increasing processing capabilities with many-core architectures. Almost every field of science is now data driven and requires analysis of massive datasets. The algorithms for analytics such as machine learning can be used to discover properties of a given dataset and make predictions based on it. However, there is still a lack of simple and unified programming frameworks for these data intensive applications, and many existing efforts are designed with specialized means to speed up individual algorithms. In this thesis research, a distributed programming model, MapCollective, is defined so that it can be easily applied to many machine learning algorithms. Specifically, algorithms that fit the iterative computation model can be easily parallelized with a unique collective communication layer for efficient synchronization. In contrast to traditional parallelization strategies that focus on handling high volume input data, a lesser known challenge is that the shared model data between parallel workers, is equally high volume in multidimensions and required to be communicated continually during the entire execution. This extends the understanding of data aspects in computation from in-memory caching of input data (e.g. iterative MapReduce model) to fine-grained synchronization on model data (e.g. MapCollective model). A library called Harp is developed as a Hadoop plugin to demonstrate that sophisticated machine learning algorithms can be simply abstracted with the MapCollective model and conveniently developed on top of the MapReduce framework. K-means and Multi-Dimensional Scaling (MDS) are tested over 4096 threads on the IU Big Red II Supercomputer. The results show linear speedup with an increasing number of parallel units.

机译：分布式计算的格局正在迅速发展，计算机在多核体系结构中的处理能力不断提高。现在，几乎每个科学领域都以数据为驱动力，并且需要分析大量数据集。诸如机器学习之类的分析算法可用于发现给定数据集的属性并基于其进行预测。但是，对于这些数据密集型应用程序仍然缺乏简单而统一的编程框架，并且许多现有的工作都采用专门的方法进行设计以加快单个算法的速度。在本文的研究中，定义了一种分布式编程模型MapCollective，以便可以轻松地将其应用于许多机器学习算法。具体而言，可以轻松地将符合迭代计算模型的算法与唯一的集体通信层并行化，以实现高效同步。与专注于处理大量输入数据的传统并行化策略相比，鲜为人知的挑战是并行工作程序之间的共享模型数据在多维上具有同样高的数量，并且需要在整个执行过程中不断进行通信。这扩展了对计算方面数据的理解，从输入数据的内存缓存（例如迭代MapReduce模型）到模型数据（例如MapCollective模型）的细粒度同步。开发了一个称为Harp的库作为Hadoop插件，以证明可以使用MapCollective模型简单地抽象复杂的机器学习算法，并在MapReduce框架的顶部方便地进行开发。在IU Big Red II超级计算机上，对4096个线程进行了K均值和多维缩放（MDS）测试。结果表明，随着并行单元数量的增加，线性加速。

著录项

来源
《2016 IEEE International Conference on Cloud Engineering Workshop》|2016年|204-206|共3页
会议地点 Berlin(DE)
作者
Bingjing Zhang;
展开▼
作者单位

Sch. of Inf. Comput., Indiana Univ., Bloomington, IN, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Computational modeling; Data models; Machine learning algorithms; Big data; Biological system modeling; Cloud computing; Libraries;

机译：计算建模;数据模型;机器学习算法;大数据;生物系统建模;云计算;图书馆;

相似文献

外文文献
中文文献
专利

1. Software Engineering Data Analytics: A Framework Based on a Multi-Layered Abstraction Mechanism [J] . Chaman WIJESIRIWARDANA, Prasad WIMALARATNE IEICE transactions on information and systems . 2019,第3期

机译：软件工程数据分析：基于多层抽象机制的框架
2. Collective object and the standardized collective objects of the common data communication protocol [J] . M. Nakamura, I. Sato 電気設備学会誌 . 1999,第5期

机译：通用数据通信协议的集合对象和标准化集合对象
3. Collective object and the standardized collective objects of the common data communication protocol [J] . M. Nakamura, I. Sato 電気設備学会誌 . 1999,第5期

机译：集体对象和公共数据通信协议的标准化集体对象
4. A Collective Communication Layer for the Software Stack of Big Data Analytics [C] . Bingjing Zhang IEEE International Conference on Cloud Engineering Workshop . 2016

机译：大数据分析软件堆栈的集体通信层
5. Harp: A Machine Learning Framework on Top of the Collective Communication Layer for the Big Data Software Stack. [D] . Zhang, Bingjing. 2017

机译：Harp：大数据软件堆栈的集体通信层之上的机器学习框架。
6. A dataset of full-stack ITS-G5 DSRC communications over licensed and unlicensed bands using a large-scale urban testbed [O] . Andrea Tassi, Ioannis Mavromatis, Robert J. Robert Piechocki 2019

机译：使用大型城市测试平台在许可和非许可频段上的全栈ITS-G5 DSRC通信的数据集
7. Modeling of GPR data in a stack of VTI-layers with an analytical code [O] . J. Hunziker, J. Thorbecke, E. Slob 2015

机译：使用分析代码对VTI层堆栈中的GpR数据进行建模
8. Research Proposal on Cognitive Opportunistic Communications and Cognitive Cross-layer Protocol Stack Design. [R] . Rao, R., Manoj, B. S., Zorzi, M. 2012

机译：认知机会通信与认知跨层协议栈设计研究方案。

A Collective Communication Layer for the Software Stack of Big Data Analytics

摘要

著录项

相似文献

相关主题

期刊订阅