...
首页> 外文期刊>Big Data Research >Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes
【24h】

Programming with BIG Data in R: Scaling Analytics from One to Thousands of Nodes

机译:使用R中的大数据进行编程:从一到数千个节点缩放分析

获取原文
获取原文并翻译 | 示例
           

摘要

We present a tutorial overview showing how one can achieve scalable performance with R. We do so by utilizing several package extensions, including those from the pbdR project. These packages consist of high performance, high-level interfaces to and extensions of MPI, PBLAS, ScaLAPACK, I/O libraries, profiling libraries, and more. While these libraries shine brightest on large distributed platforms, they also work rather well on small clusters and often, surprisingly, even on a laptop with only two cores. Our tutorial begins with recommendations on how to get more performance out of your R code before considering parallel implementations. Because R is a high-level language, a function can have a deep hierarchy of operations. For big data, this can easily lead to inefficiency. Profiling is an important tool to understand the performance of an R code for both serial and parallel improvements. The pbdR packages provide a highly scalable capability for the development of novel distributed data analysis algorithms. This level of scalability is unmatched in other analysis software. Interactive speeds (seconds) are achieved for complex analysis algorithms on data 100GB and more. This is possible because the interfaces add little overhead to the scalable libraries and their extensions. Furthermore, this is often achieved with little or no change to serial R codes. Our overview includes codes of varying complexity, illustrating reading data in parallel, the process of changing a serial code to a distributed parallel code, and how to engage distributed matrix computation from within R.
机译:我们展示了一个教程概述,显示如何通过利用多个包扩展来实现与R的可扩展性能。这些包由MPI,PBLA,ScalaCack,I / O库,分析库和更多的高性能,高级接口和扩展组成。虽然这些库在大型分布式平台上闪耀最亮,但它们也在小型集群上工作,往往令人惊讶的是,即使在只有两个核心的笔记本电脑上也是如此。我们的教程始于关于如何在考虑并行实现之前在R代码中获得更多性能的建议。因为R是一种高级语言,所以功能可以具有深度的操作层次。对于大数据,这很容易导致低效率。分析是了解R代码的性能以实现串行和并行改进的重要工具。 PBDR软件包为新颖的分布式数据分析算法提供了高度可扩展的能力。在其他分析软件中,这种可伸缩性是无与伦比的。对于数据100GB和更多的复杂分析算法,实现了交互式速度(秒)。这是可能的,因为接口为可扩展库及其扩展提供了很少的开销。此外,这通常是串行R代码几乎没有或没有变化。我们的概述包括不同复杂性的代码,并行地说明读取数据,将序列代码更改为分布式并行代码的过程,以及如何从R内部接合分布式矩阵计算。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号