首页> 外文会议>ACM SIGPLAN Symposium on Priciples and Practice of Parallel Programming >Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods
【24h】

Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods

机译:在线 - ABFT:基于在线算法的迭代方法软错误检测的容错方案

获取原文

摘要

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Large supercomputers are especially susceptible to soft errors because of their large number of components. Soft errors can generally be detected offline through the comparison of the final computation results of two duplicated computations, but this approach often introduces significant overhead. This paper presents Online-ABFT, a simple but efficient online soft error detection technique that can detect soft errors in the widely used Krylov subspace iterative methods in the middle of the program execution so that the computation efficiency can be improved through the termination of the corrupted computation in a timely manner soon after a soft error occurs. Based on a simple verification of orthogonality and residual, Online-ABFT is easy to implement and highly efficient. Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.
机译:软错误是损坏计算系统状态但不是其整体功能的一次性事件。由于大量组件,大型超级计算机特别容易受到软误差的影响。通常可以通过比较两个重复计算的最终计算结果的比较来检测软错误,但这种方法通常会引入大量的开销。本文展示了在线 - ABFT,这是一种简单但有效的在线软错误检测技术,可以在程序执行中广泛使用的Krylov子空间迭代方法中检测软错误,以便通过损坏的终止可以提高计算效率在发生软错误后立即计算。基于对正交性和残差的简单验证,在线 - ABFT易于实施和高效。实验结果表明,当与检查点一起使用该在线错误检测方法时,它可以提高通过传统的离线方法通过多个数量级获得正确的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号