On the Trend of Resilience for GPU-Dense Systems

机译：GPU密集型系统的弹性趋势

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Emerging high-performance computing (HPC) systems show a tendency towards heterogeneous nodes that are dense with accelerators such as GPUs. They offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes susceptible to accelerator failures as well as sharing intra-node resources with many accelerators. Such failures must be recovered by end-to-end resilience schemes such as checkpoint-restart. However, preserving a large amount of local state within accelerators for checkpointing incurs significant overhead. This trend reveals a new challenge for the resilience in accelerator-dense systems. We study its impact in multi-level checkpointing systems and with burst buffers. We quantify the system-level efficiency for resilience, sweeping the failure rate, system scale, and GPU density. Our multi-level checkpoint-restart model shows that the efficiency begins to drop at a 16:1 GPU-to-CPU ratio in a 3.6 EFLOP system and a ratio of 64:1 degrades overall system efficiency by 5%. Furthermore, we quantify the system-level impact of possible design considerations for the resilience in GPU-dense systems to mitigate this challenge.

机译：新兴的高性能计算（HPC）系统显示出倾向于使用GPU等加速器的异构节点的趋势。与仅使用同类CPU的节点相比，它们以较低的能源和成本提供了更高的计算能力。尽管具有加速器的机器减少了实现性能目标所需的计算节点总数，但单个节点变得容易遭受加速器故障以及与许多加速器共享节点内资源。必须通过诸如检查点重启之类的端到端弹性方案来恢复此类故障。但是，在加速器中保留大量本地状态以进行检查点会导致大量开销。这种趋势揭示了加速器密集系统的弹性方面的新挑战。我们研究了它在多级检查点系统和突发缓冲区中的影响。我们量化系统级的弹性，以降低故障率，系统规模和GPU密度。我们的多级检查点重新启动模型显示，在3.6 EFLOP系统中，效率以16：1的GPU与CPU比率开始下降，而64：1的比率使整体系统效率降低5％。此外，我们针对GPU密集型系统的弹性来量化可能的设计考虑因素对系统级的影响，以缓解这一挑战。

著录项

来源
《2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Supplemental Volume》|2019年|29-34|共6页
会议地点 Portland(US)
作者
Kyushick Lee; Michael B. Sullivan; Siva Kumar Sastry Hari; Timothy Tsai; Stephen W. Keckler; Mattan Erez;
展开▼
作者单位

University of Texas at Austin;

NVIDIA;

NVIDIA;

NVIDIA;

NVIDIA;

University of Texas at Austin;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
Graphics processing units; Checkpointing; Resilience; Bandwidth; Acceleration; Market research; Reliability;

机译：图形处理单元;检查点;弹性;带宽;加速;市场研究;可靠性;;

相似文献

外文文献
中文文献
专利

1. Fog computing systems: State of the art, research issues and future trends, with a focus on resilience [J] . Moura Jose, Hutchison David Journal of network and computer applications . 2020,第Nova期

机译：雾计算系统：艺术状态，研究问题和未来趋势，重点是恢复力
2. Statistical trend tests for resilience of power systems [J] . Shen Lijuan, Cassottana Beatrice, Tang Loon Ching Reliability Engineering & System Safety . 2018,第SEPa期

机译：电力系统弹性统计趋势测试
3. Resilience: A Bridging Concept or a Dead End? âReframingâ Resilience: Challenges for Planning Theory and Practice Interacting Traps: Resilience Assessment of a Pasture Management System in Northern Afghanistan Urban Resilience: What Does it Mean in Planning Practice? Resilience as a Useful Concept for Climate Change Adaptation? The Politics of Resilience for Planning: A Cautionary Note [J] . SiminDavoudia*KeithShawbL.JamilaHaidercAllysonE.QuinlandGarryD.PetersoneCathyWilkinsonfHartmutFÃ¼nfgeldgDarrynMcEvoygLibbyPorterhSiminDavoudii Planning Theory & Practice . 2012,第2期

机译：弹性：桥接概念还是死胡同？ “重塑”弹性：规划理论和实践互动陷阱的挑战：阿富汗北部牧场管理系统的弹性评估城市弹性：这在规划实践中意味着什么？复原力是适应气候变化的有用概念？规划的弹性政治：警告
4. On the Trend of Resilience for GPU-Dense Systems [C] . Kyushick Lee, Michael B. Sullivan, Siva Kumar Sastry Hari, Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Supplemental Volume . 2019

机译：论GPU密集系统的复原力趋势
5. Generic quantitative formulation of resilience and contra-resilience for engineered systems [D] . Henry, Devanandham. 2015

机译：工程系统的弹性和反弹性的通用定量公式
6. Measuring the resilience of health systems in low- and middle-income countries: a focus on community resilience [O] . Sudip Bhandari, Olakunle Alonge 2020

机译：测量低收入和中等收入国家的卫生系统的恢复力：专注于社区恢复力
7. Supply-at-Risk: Resilience Metric for Infrastructure Systems: Framework for assessing and comparing resilience of infrastructure systems in urban areas [O] . Mateusz Iwo Dubaniowski, Hans R. Heinimann 2019

机译：供应风险：基础设施系统的弹性度量：评估和比较城市地区基础设施系统的恢复性的框架

On the Trend of Resilience for GPU-Dense Systems

摘要

著录项

相似文献

相关主题

期刊订阅