首页> 外文会议>IEEE International Conference on Cluster Computing >Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App
【24h】

Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App

机译:迈向建筑弹性科学应用:使用CLAMR Hydrodynamics Mini-App对软误差和瞬态误差容差的影响进行弹性分析

获取原文

摘要

In this paper, we present a resilience analysis of the impact of soft errors on CLAMR, a hydrodynamics miniapp for high performance computing (HPC). Leveraging the conservation of mass law, we design a fault detection mechanism and checkpoint/restart fault tolerance approach to enhance the resilience of CLAMR. Overall, our approach can detect up to 88.3% of faults that propagate into SDC or crashes with minimal (less than 1%) overhead for the optimal configuration. We show that CLAMR's fault-tolerance depends on when a fault is injected into the simulation and we also evaluate the frequency of detection and checkpointing on performance.
机译:在本文中,我们介绍了对CLAMR(一种用于高性能计算(HPC)的流体动力学微型应用程序)的软错误影响的弹性分析。利用质量法则的守恒,我们设计了一种故障检测机制和检查点/重新启动容错方法,以增强CLAMR的弹性。总体而言,对于最佳配置,我们的方法可以以最小的开销(小于1%)检测到传播到SDC或崩溃的故障的多达88.3%。我们表明CLAMR的容错性取决于何时将故障注入到仿真中,并且我们还评估了性能检测和检查点的频率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号