首页> 外文会议>IEEE International Congress on Big Data >YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics

【24h】

YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics

机译：YinMem：用于大规模数据分析的分布式并行索引内存计算系统

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Machine learning and graph analytics typically process data in an iterative way, reading the same data multiple times and sharing intermediate results across the worker nodes in cluster. Hadoop MapReduce and Spark are two popular open source cluster compute frameworks for large scale data analytics. Apache Spark is currently the state-of-the-art in-memory computation model extending MapReduce by transforming data into RDDs stored in memory. One limitation of Spark, however, lies in the fact that data transformation and distribution is implicitly managed by HDFS. Data locality is not guaranteed for iterative machine learning algorithms which read the same data multiple times. For example, data needed for operations to one worker node might reside in RDDs stored in other worker nodes. The resulting data shuffling becomes a bottleneck when iteratively reading such RDDs. We propose YinMem, a parallel distributed indexed in-memory computation system, bridging the gap between Hadoop ecosystem and HPC by replacing MapReduce with MPI while obtaining the advantage of the distributed data storage. YinMem achieves fair load balancing prior to computation for large sparse matrix by scheduling and distributing indexed data from NoSQL database to the RAM of working nodes. YinMem explores Alluxio as the in-memory storage system and enables efficient data sharing of intermediate results. Preliminary results show that YinMem has achieved 3× speedup to Spark, for computing eigenvalue and eigenvectors of a 16-million scale sparse matrix.

机译：机器学习和图形分析通常以迭代方式处理数据，多次读取同一数据，并在集群中的所有工作节点之间共享中间结果。 Hadoop MapReduce和Spark是用于大规模数据分析的两个流行的开源集群计算框架。 Apache Spark当前是最先进的内存计算模型，它通过将数据转换为存储在内存中的RDD来扩展MapReduce。但是，Spark的一个局限性在于，数据转换和分发由HDFS隐式管理。对于多次读取相同数据的迭代机器学习算法，不能保证数据的局部性。例如，对一个工作程序节点进行操作所需的数据可能驻留在其他工作程序节点中存储的RDD中。迭代读取此类RDD时，导致的数据混排成为瓶颈。我们提出了YinMem，这是一个并行的分布式索引的内存计算系统，它通过用MPI替换MapReduce来弥合Hadoop生态系统和HPC之间的差距，同时获得分布式数据存储的优势。通过对NoSQL数据库中的索引数据进行调度并将其分配到工作节点的RAM中，YinMem可以在计算大型稀疏矩阵之前实现公平的负载平衡。 YinMem将Alluxio用作内存存储系统，并实现了中间结果的有效数据共享。初步结果显示，YinMem已达到Spark的3倍加速，用于计算1600万规模的稀疏矩阵的特征值和特征向量。

著录项

来源
《IEEE International Congress on Big Data》|2016年|214-222|共9页
会议地点
作者
Yin Huang; Yelena Yesha; Milton Halem; Yaacov Yesha; Shujia Zhou;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
sparse matrices; data analysis; iterative methods; learning (artificial intelligence); parallel programming;

机译：稀疏矩阵;数据分析;迭代方法;学习（人工智能）;并行编程;

相似文献

外文文献
中文文献
专利

1. Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling [J] . Weijian Zheng, Fengguang Song, Lan Lin, Parallel Processing Letters . 2018,第1期

机译：通过分布式调度运行时系统和分析建模来扩展平台QR因子化的并行计算
2. A Special Issue of Journal of Parallel and Distributed Computing: Scalable Systems for Big Data Management and Analytics [J] . Srinivas Alum, Yogesh Simmhan Journal of Parallel and Distributed Computing . 2013,第6期

机译：《并行与分布式计算杂志》特刊：用于大数据管理和分析的可扩展系统
3. A GPU-Accelerated In-Memory Metadata Management Scheme for Large-Scale Parallel File Systems [J] . Zhi-Guang Chen, Yu-Bo Liu, Yong-Feng Wang, 计算机科学技术学报（英文版） . 2021,第001期

机译：用于大型并行文件系统的GPU加速内存元数据管理方案
4. YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics [C] . Yin Huang, Yelena Yesha, Milton Halem, IEEE International Congress on Big Data . 2016

机译：YINMEM：用于大规模数据分析的分布式并行索引内存存储系统
5. yInMem: A Parallel Distributed Indexed In-Memory Computation System for Big Data Analytics. [D] . Huang, Yin. 2017

机译：yInMem：用于大数据分析的并行分布式索引内存计算系统。
6. iSPEED: a Scalable and Distributed In-Memory Based Spatial Query System for Large and Structurally Complex 3D Data [O] . Hoang Vo, Yanhui Liang, Jun Kong, -1

机译：iSPEED：适用于大型且结构复杂的3D数据的可扩展的分布式基于内存的空间查询系统
7. Scalable data abstractions for distributed parallel computations [O] . Hanlon, James, Hollis, Simon J., May, David 2012

机译：用于分布式并行计算的可伸缩数据抽象
8. Analytic Computational Method for Parallel Simulation of Distributed ParameterSystems [R] . Dekker, L. 1994

机译：分布参数系统并行仿真的解析计算方法

YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics

摘要

著录项

相似文献

相关主题

期刊订阅