首页> 外文会议>International Conference on Parallel Processing >Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs
【24h】

Toward Harnessing DOACROSS Parallelism for Multi-GPGPUs

机译:朝着利用多GPGPUS的DoacrossParpastication

获取原文

摘要

To exploit the full potential of GPGPUs for general purpose computing, DOACR parallelism abundant in scientific and engineering applications must be harnessed. However, the presence of cross-iteration data dependences in DOACR loops poses an obstacle to execute their computations concurrently using a massive number of fine-grained threads. This work focuses on iterative PDE solvers rich in DOACR parallelism to identify optimization principles and strategies that allow their efficient mapping to GPGPUs. Our main finding is that certain DOACR loops can be accelerated further on GPGPUs if they are algorithmically restructured (by a domain expert) to be more amendable to GPGPU parallelization, judiciously optimized (by the compiler), and carefully tuned by a performance-tuning tool. We substantiate this finding with a case study by presenting a new parallel SSOR method that admits more efficient data-parallel SIMD execution than red-black SOR on GPGPUs. Our solution is obtained non-conventionally, by starting from a K-layer SSOR method and then parallelizing it by applying a non-dependence-preserving scheme consisting of a new domain decomposition technique followed by a generalized loop tiling. Despite its relatively slower convergence, our new method outperforms red-black SOR by making a better balance between data reuse and parallelism and by trading off convergence rate for SIMD parallelism. Our experimental results highlight the importance of synergy between domain experts, compiler optimizations and performance tuning in maximizing the performance of applications, particularly PDE-based DOACR loops, on GPGPUs.
机译:为了利用GPGPU的全部潜力进行通用计算,必须利用科学和工程应用中丰富的Doacr并行性。然而,DOACR循环中的交叉迭代数据依赖性的存在构成了使用大量的细粒线同时执行其计算的障碍。这项工作侧重于富含Doacr并行性的迭代PDE求解器,以确定允许其高效映射到GPGPU的优化原则和策略。我们的主要发现是,如果算法地重组(通过域专家),可以在GPGPU上进一步加速某些DoACR环路,以便更明显地优化(由编译器),并通过性能调整工具仔细调整。通过呈现一个新的并行SSOR方法,通过呈现比GPGPU上的红黑色SIMD执行更高的数据并行SIMD执行,以案例研究证实了这一发现。通过从k层SSOR方法开始,通过应用由新域分解技术组成的非依赖性保存方案,然后通过新的域分解技术并行地,我们的解决方案是非传统的。尽管收敛性相对较慢,但我们的新方法通过在数据重用和并行性之间进行更好的平衡,并通过交易SIMDParpastication的收敛速度来优于红黑体验。我们的实验结果突出了域专家,编译器优化和性能调整在最大化应用程序,特别是PDE的DoACR环路上的性能调整之间的重要性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号