【24h】

On the Trend of Resilience for GPU-Dense Systems

机译:GPU密集型系统的弹性趋势

获取原文
获取原文并翻译 | 示例

摘要

Emerging high-performance computing (HPC) systems show a tendency towards heterogeneous nodes that are dense with accelerators such as GPUs. They offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes susceptible to accelerator failures as well as sharing intra-node resources with many accelerators. Such failures must be recovered by end-to-end resilience schemes such as checkpoint-restart. However, preserving a large amount of local state within accelerators for checkpointing incurs significant overhead. This trend reveals a new challenge for the resilience in accelerator-dense systems. We study its impact in multi-level checkpointing systems and with burst buffers. We quantify the system-level efficiency for resilience, sweeping the failure rate, system scale, and GPU density. Our multi-level checkpoint-restart model shows that the efficiency begins to drop at a 16:1 GPU-to-CPU ratio in a 3.6 EFLOP system and a ratio of 64:1 degrades overall system efficiency by 5%. Furthermore, we quantify the system-level impact of possible design considerations for the resilience in GPU-dense systems to mitigate this challenge.
机译:新兴的高性能计算(HPC)系统显示出倾向于使用GPU等加速器的异构节点的趋势。与仅使用同类CPU的节点相比,它们以较低的能源和成本提供了更高的计算能力。尽管具有加速器的机器减少了实现性能目标所需的计算节点总数,但单个节点变得容易遭受加速器故障以及与许多加速器共享节点内资源。必须通过诸如检查点重启之类的端到端弹性方案来恢复此类故障。但是,在加速器中保留大量本地状态以进行检查点会导致大量开销。这种趋势揭示了加速器密集系统的弹性方面的新挑战。我们研究了它在多级检查点系统和突发缓冲区中的影响。我们量化系统级的弹性,以降低故障率,系统规模和GPU密度。我们的多级检查点重新启动模型显示,在3.6 EFLOP系统中,效率以16:1的GPU与CPU比率开始下降,而64:1的比率使整体系统效率降低5%。此外,我们针对GPU密集型系统的弹性来量化可能的设计考虑因素对系统级的影响,以缓解这一挑战。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号