首页> 外文期刊>Cloud Computing, IEEE Transactions on >A Data Skew Oriented Reduce Placement Algorithm Based on Sampling
【24h】

A Data Skew Oriented Reduce Placement Algorithm Based on Sampling

机译:基于采样的数据偏斜衰减放置算法

获取原文
获取原文并翻译 | 示例
           

摘要

For frequent disk I/O and large data transmissions among different racks and physical nodes, intermediate data communication has become the most important performance bottle-neck in most running Hadoop systems. This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes of clusters or racks for data locality. Because the number of keys cannot be counted until the input data are processed by map tasks, this paper applies a reservoir algorithm for sampling the input data, which can bring the distribution of keys/values closer to the overall situation of original data. Based on the distribution matrix of the intermediate results in each partition, by calculating the distance and cost matrices among the cross node communication, the related map and reduce tasks can be scheduled to relatively nearby physical nodes for data locality. We implement CORP in Hadoop 2.4.0 and evaluate its performance using three widely used benchmarks: Sort, Grep, and Join. In these experiments, an evaluation model is proposed for selecting the appropriate sample rates, which can comprehensively consider the importance of cost, effect, and variance in sampling. Experimental results show that CORP can not only improve the balance of reduces tasks effectively but also decreases the job execution time for the lower inner data communication. Compared with some other reduce scheduling algorithms, the average data transmission of the entire system on the core switch has been reduced substantially.
机译:对于不同机架和物理节点之间的频繁磁盘I / O和大数据传输,中间数据通信已成为大多数运行Hadoop系统中最重要的性能瓶颈。本文提出了一种称为CORP的缩减放置算法,以计划相关的映射,并减少截图的截止节点的任务或用于数据局部的机架。由于无法计算键的数量,直到输入数据通过MAP任务处理,所以应用程序算法用于采样输入数据,这可以将键/值的分布较近原始数据的整体情况。基于每个分区中的中间结果的分布矩阵,通过计算横向节点通信中的距离和成本矩阵,可以将相关的地图和减少任务调度到相对附近的数据局部性的物理节点。我们在Hadoop 2.4.0中实现CORP,并使用三个广泛使用的基准评估其性能:Sort,Grep和Join。在这些实验中,提出了一种评估模型,用于选择适当的采样率,这可以全面地考虑成本,效果和差异对取样的重要性。实验结果表明,CORP不能有效地改善减少任务的平衡,而且还降低了下内部数据通信的作业执行时间。与一些其他降低调度算法相比,核心开关上整个系统的平均数据传输基本上已经减少。

著录项

  • 来源
    《Cloud Computing, IEEE Transactions on》 |2020年第4期|1149-1161|共13页
  • 作者单位

    Hunan Univ Coll Informat Sci & Engn Changsha 410082 Hunan Peoples R China|Natl Supercomp Ctr Changsha Changsha 410082 Hunan Peoples R China;

    Hunan Univ Coll Informat Sci & Engn Changsha 410082 Hunan Peoples R China|Natl Supercomp Ctr Changsha Changsha 410082 Hunan Peoples R China;

    Hunan Univ Coll Informat Sci & Engn Changsha 410082 Hunan Peoples R China|Natl Supercomp Ctr Changsha Changsha 410082 Hunan Peoples R China;

    Hunan Univ Coll Informat Sci & Engn Changsha 410082 Hunan Peoples R China|Natl Supercomp Ctr Changsha Changsha 410082 Hunan Peoples R China|SUNY Coll New Paltz Dept Comp Sci New Paltz NY 12561 USA;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Data sampling; data skew; inner communication; MapReduce; reduce placement;

    机译:数据采样;数据偏斜;内部通信;mapreduce;减少安置;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号