...
首页> 外文期刊>International Journal of High Performance Computing Applications >A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems
【24h】

A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems

机译:一种调谐且可扩展的快速多极方法,是亿亿级系统的卓越算法

获取原文
获取原文并翻译 | 示例
           

摘要

Among the algorithms that are likely to play a major role in future exascale computing, the fast multipole method (fmm) appears as a rising star. Our previous recent work showed scaling of an fmm on gpu clusters, with problem sizes of the order of billions of unknowns. That work led to an extremely parallel fmm, scaling to thousands of gpus or tens of thousands of cpus. This paper reports on a campaign of performance tuning and scalability studies using multi-core cpus, on the Kraken supercomputer. All kernels in the fmm were parallelized using OpenMP, and a test using 10~7 particles randomly distributed in a cube showed 78% efficiency on 8 threads. Tuning of the particle-to-particle kernel using single instruction multiple data (SIMD) instructions resulted in 4 x speed-up of the overall algorithm on single-core tests with 10~3-10~7 particles. Parallel scalability was studied in both strong and weak scaling. The strong scaling test used 10~8 particles and resulted in 93% parallel efficiency on 2048 processes for the non-SIMD code and 54% for the SIMD-optimized code (which was still 2 × faster). The weak scaling test used 10~6 particles per process, and resulted in 72% efficiency on 32,768 processes, with the largest calculation taking about 40 seconds to evaluate more than 32 billion unknowns. This work builds up evidence for our view that fmm is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape. The code is open for unrestricted use under the MIT license.
机译:在可能在未来的百亿亿次计算中扮演重要角色的算法中,快速多极方法(fmm)似乎是后起之秀。我们之前的最新工作表明,在gpu群集上扩展了fmm,问题规模约为数十亿个未知数。这项工作导致了极其并行的fmm,可扩展到数千gpu或数万cpus。本文报告了在Kraken超级计算机上使用多核cpus进行性能调整和可伸缩性研究的活动。使用OpenMP将fmm中的所有内核并行化,并且使用随机分布在一个立方体中的10〜7个粒子进行的测试显示8个线程的效率为78%。使用单指令多数据(SIMD)指令对粒子间粒子内核进行调整后,在10〜3-10〜7个粒子的单核测试中,整个算法的速度提高了4倍。在强和弱缩放方面都研究了并行可伸缩性。强大的缩放测试使用了10〜8个粒子,对于非SIMD代码,在2048个进程上的并行效率为93%,对于SIMD优化代码,则为54%(仍然快2倍)。弱缩放测试每个过程使用10〜6个粒子,对32,768个过程的效率为72%,最大的计算大约需要40秒才能评估超过320亿个未知数。这项工作为我们认为fmm有望在百亿分之一的计算中发挥主导作用提供了证据,并且在本文结尾处讨论了使之成为新兴的异构和大规模并行建筑景观的特别有利算法的功能。该代码根据MIT许可开放供非限制使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号