首页> 外文学位 >Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects.
【24h】

Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects.

机译:在具有高性能互连的异构集群上有效使用MPI和PGAS编程模型。

获取原文
获取原文并翻译 | 示例

摘要

Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems with high compute density and high performance per watt. Application developers also use a hierarchy of programming models to extract maximum performance from these heterogeneous systems.;Computation and communication overlap has been a critical requirement for applications, to achieve peak performance on large-scale systems. Communication overheads have a magnified impact on heterogeneous clusters due to their higher compute density and hence, a higher wastage in compute power. Modern interconnects like InfiniBand, with their Remote DMA capabilities, enable asynchronous progress of communication, freeing up the cores to do useful computation. MPI and PGAS models offer light-weight, one-sided communication primitives that minimize process synchronization overheads and enable better computation and communication overlap.;This dissertation has targeted several of these challenges for programming on GPU and Intel MIC clusters. Our work with MVAPICH2-GPU enabled the use of MPI in a unified manner, for communication from host and GPU device memories. It takes advantage of unified virtual addressing (UVA) provided by CUDA. We proposed designs in the MVAPICH2-GPU runtime to significantly improve the performance of internode and intranode GPU-GPU communication by pipelining and overlapping memory, PCIe and network transfers. We take advantage of CUDA features, such as IPC, GPUDirect RDMA, and CUDA kernels to further reduce communication overheads. MVAPICH2-GPU improves programmability by removing the need for developers to use CUDA and MPI for GPU-GPU communication, while improving performance through runtime-level optimizations that are transparent to the user. We have shown up to 69% and 45% improvement in point-to-point latency for data movement for 4Byte and 4MB messages, respectively. Likewise, the solutions improve the bandwidth by 2x and 56% for 4KByte and 64 KByte messages, respectively. Our work have been released as part of MVAPICH2 packages, making it the first MPI library to support direct GPU-GPU communication. It is currently deployed and used on several large GPU clusters across the world, including Tsubame 2.0 and Keeneland. We proposed novel extensions to the OpenSHMEM PGAS model that enable unified communication from host and GPU memories. We present designs for optimized internode and intranode one-sided communication on GPU clusters, using asynchronous threads and DMA-based techniques. The proposed extensions, coupled with an efficient runtime, improve the latency of 4 Byte shmem getmem latency by 90%, 40%, and 17%, for intra-IOH, inter-IOH, and inter-node GPU configurations with CUDA, respectively. They improve the performance of Stencil2D and BFS kernels by 65% and 12% on clusters of 192 and 96 GPUs, respectively.;Through MVAPICH2-MIC, we proposed designs for an efficient MPI runtime on clusters with Intel Xeon Phi coprocessors. These designs improve performance of Intra-MIC, Intra-Node and Inter-Node communication on various cluster configurations, while hiding the system complexity from the user. Our designs take advantage of SCIF, Intel's low-level communication API, in addition to standard communication channels like shared memory and IB verbs, to offer substantial performance gains in performance of the MVAPICH2 MPI library. PRISM, a proxy-based multi-channel design in MVAPICH2-MIC allows applications to overcome the performance bottlenecks imposed by state-of-the-art processor architectures and extract the full compute potential of the MIC coprocessors. The proposed designs deliver up to 70% improvement in the point-to-point latency and more than 6x improvement in peak uni-directional bandwidth from Xeon Phi to the Host. Using PRISM, we improve inter-node latency between MICs by up to 65% and bandwidth by up to 5 times. PRISM improves the performance of MPI Alltoall operation by up to 65%, with 256 processes. It improves the performance of 3D Stencil communication kernel and P3DFFT library by 56% and 22% with 1024 and 512 processes, respectively.;We have shown the potential benefits of using MPI one-sided communication semantics for overlapping computation and communication, in a real-world seismic modeling application, AWP-ODC. We have shown a 12% improvement in overall application performance on 4,096 cores. This effort was also part of the application's entry as a Gordon Bell finalist at SC'2010. We demonstrated the potential performance benefits of using one-sided communication semantics on GPU clusters. We presented an efficient design for MPI-3 RMA model on NVIDIA GPU clusters with GPUDirect RDMA and proposed minor extensions to the model that can further reduce synchronization overheads. The proposed extension to the RMA model enables an inter-node ping-pong latency of 2.78usec between GPUs---a 60% improvement over latency offered by send/recv operations. One-sided communication provides 2x the message rate achieved using MPI Send/Recv operations. One-sided semantics improve the latency of a 3DStencil communication kernel---by up to 27%. (Abstract shortened by UMI.).
机译:加速器(例如NVIDIA GPU)和协处理器(例如英特尔MIC / Xeon Phi)正在推动具有高计算密度和每瓦高性能的下一代超大规模系统的发展。应用程序开发人员还使用编程模型的层次结构从这些异构系​​统中提取最大性能。计算和通信重叠对于应用程序来说是至关重要的要求,以在大型系统上实现峰值性能。通信开销对异构集群的影响更大,原因是它们的计算密度较高,因此浪费了更多的计算能力。像InfiniBand这样的现代互连,凭借其远程DMA功能,可以实现通信的异步进行,从而释放了内核以进行有用的计算。 MPI和PGAS模型提供了一种轻量级的单面通信原语,这些原语可以最大程度地减少过程同步开销,并实现更好的计算和通信重叠。我们与MVAPICH2-GPU的合作使MPI能够以统一的方式用于主机和GPU设备内存的通信。它利用了CUDA提供的统一虚拟寻址(UVA)。我们在MVAPICH2-GPU运行时中提出了设计,以通过流水线和重叠内存,PCIe和网络传输来显着提高节点间和节点内GPU-GPU通信的性能。我们利用CUDA功能(例如IPC,GPUDirect RDMA和CUDA内核)来进一步减少通信开销。 MVAPICH2-GPU通过消除开发人员使用CUDA和MPI进行GPU-GPU通信的需求来提高可编程性,同时通过对用户透明的运行时级优化来提高性能。对于4Byte和4MB消息,数据移动的点对点延迟分别提高了69%和45%。同样,这些解决方案分别将4KB和64KB消息的带宽提高了2倍和56%。我们的工作已作为MVAPICH2软件包的一部分发布,使其成为第一个支持直接GPU-GPU通信的MPI库。目前,它已在包括Tsubame 2.0和Keeneland在内的全球多个大型GPU集群中部署和使用。我们提出了对OpenSHMEM PGAS模型的新颖扩展,该扩展可实现来自主机和GPU内存的统一通信。我们介绍了使用异步线程和基于DMA的技术在GPU群集上优化节点间和节点内单侧通信的设计。拟议的扩展与有效的运行时相结合,分别对带有CUDA的IOH内部,IOH内部和节点间GPU配置的4字节shmem getmem延迟分别提高了90%,40%和17%。它们分别在192个GPU和96个GPU的集群上将Stencil2D和BFS内核的性能分别提高了65%和12%。通过MVAPICH2-MIC,我们提出了在具有Intel Xeon Phi协处理器的集群上实现高效MPI运行时的设计。这些设计提高了各种群集配置上的MIC内,节点内和节点间通信的性能,同时向用户隐藏了系统复杂性。除了共享存储和IB动词之类的标准通信通道外,我们的设计还利用了英特尔的低级通信API SCIF来显着提高MVAPICH2 MPI库的性能。 PRISM是MVAPICH2-MIC中基于代理的多通道设计,它使应用程序可以克服由最新处理器架构带来的性能瓶颈,并提取MIC协处理器的全部计算潜力。拟议的设计可将点对点延迟提高70%,并将从Xeon Phi到主机的峰值单向带宽提高6倍以上。使用PRISM,我们可以将MIC之间的节点间延迟提高多达65%,并将带宽提高多达5倍。 PRISM通过256个进程将MPI Alltoall操作的性能提高了65%。它通过1024和512进程分别将3D Stencil通信内核和P3DFFT库的性能提高了56%和22%。;我们已经展示了使用MPI单侧通信语义进行重叠的计算和通信的潜在好处,实际上世界地震建模应用程序AWP-ODC。我们在4个方面的整体应用程序性能提高了12%,096核。作为SC'2010的Gordon Bell决赛选手,这项工作也是该应用程序进入的一部分。我们展示了在GPU群集上使用单面通信语义的潜在性能优势。我们针对具有GPUDirect RDMA的NVIDIA GPU集群,提出了针对MPI-3 RMA模型的有效设计,并提出了对该模型的较小扩展,以进一步降低同步开销。 RMA模型的拟议扩展使GPU之间的节点间乒乓延迟为2.78usec,与发送/接收操作所提供的延迟相比提高了60%。单面通信可提供使用MPI发送/接收操作实现的消息速率的两倍。单面语义可将3DStencil通信内核的延迟提高多达27%。 (摘要由UMI缩短。)。

著录项

  • 作者

    Potluri, Sreeram.;

  • 作者单位

    The Ohio State University.;

  • 授予单位 The Ohio State University.;
  • 学科 Computer science.;Computer engineering.
  • 学位 Ph.D.
  • 年度 2014
  • 页码 210 p.
  • 总页数 210
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号