首页> 外文会议>IEEE International Congress on Big Data >YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics
【24h】

YinMem: A distributed parallel indexed in-memory computation system for large scale data analytics

机译:YinMem:用于大规模数据分析的分布式并行索引内存计算系统

获取原文

摘要

Machine learning and graph analytics typically process data in an iterative way, reading the same data multiple times and sharing intermediate results across the worker nodes in cluster. Hadoop MapReduce and Spark are two popular open source cluster compute frameworks for large scale data analytics. Apache Spark is currently the state-of-the-art in-memory computation model extending MapReduce by transforming data into RDDs stored in memory. One limitation of Spark, however, lies in the fact that data transformation and distribution is implicitly managed by HDFS. Data locality is not guaranteed for iterative machine learning algorithms which read the same data multiple times. For example, data needed for operations to one worker node might reside in RDDs stored in other worker nodes. The resulting data shuffling becomes a bottleneck when iteratively reading such RDDs. We propose YinMem, a parallel distributed indexed in-memory computation system, bridging the gap between Hadoop ecosystem and HPC by replacing MapReduce with MPI while obtaining the advantage of the distributed data storage. YinMem achieves fair load balancing prior to computation for large sparse matrix by scheduling and distributing indexed data from NoSQL database to the RAM of working nodes. YinMem explores Alluxio as the in-memory storage system and enables efficient data sharing of intermediate results. Preliminary results show that YinMem has achieved 3× speedup to Spark, for computing eigenvalue and eigenvectors of a 16-million scale sparse matrix.
机译:机器学习和图形分析通常以迭代方式处理数据,多次读取同一数据,并在集群中的所有工作节点之间共享中间结果。 Hadoop MapReduce和Spark是用于大规模数据分析的两个流行的开源集群计算框架。 Apache Spark当前是最先进的内存计算模型,它通过将数据转换为存储在内存中的RDD来扩展MapReduce。但是,Spark的一个局限性在于,数据转换和分发由HDFS隐式管理。对于多次读取相同数据的迭代机器学习算法,不能保证数据的局部性。例如,对一个工作程序节点进行操作所需的数据可能驻留在其他工作程序节点中存储的RDD中。迭代读取此类RDD时,导致的数据混排成为瓶颈。我们提出了YinMem,这是一个并行的分布式索引的内存计算系统,它通过用MPI替换MapReduce来弥合Hadoop生态系统和HPC之间的差距,同时获得分布式数据存储的优势。通过对NoSQL数据库中的索引数据进行调度并将其分配到工作节点的RAM中,YinMem可以在计算大型稀疏矩阵之前实现公平的负载平衡。 YinMem将Alluxio用作内存存储系统,并实现了中间结果的有效数据共享。初步结果显示,YinMem已达到Spark的3倍加速,用于计算1600万规模的稀疏矩阵的特征值和特征向量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号