首页> 外文会议>Proceedings of the 2011 ACM international conference on supercomputing. >High Performance Linpack Benchmark:A Fault Tolerant Implementation without Checkpointing
【24h】

High Performance Linpack Benchmark:A Fault Tolerant Implementation without Checkpointing

机译:高性能Linpack基准:无需检查点的容错实现

获取原文
获取原文并翻译 | 示例

摘要

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modifies a large amount of memory in each iteration) to tolerate fail-stop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demonstrate that, for the right-looking LU factorization algorithm, the checksum is maintained at each step of the computation. Based on this checksum relationship maintained at each step in the middle of the computation, we demonstrate that fail-stop process failures in High Performance Linpack can be tolerated without checkpointing. Because no periodical checkpoint is necessary during computation and no roll-back is necessary during recovery, the proposed recovery scheme is highly scalable and has a good potential to scale to extreme scale computing and beyond. Experimental results on the supercomputer Jaguar demonstrate that the fault tolerance overhead introduced by the proposed recovery scheme is negligible.
机译:随着高性能计算应用程序中使用的处理器数量的增加,在计算结束之前发生故障的可能性也会增加。对于使用大量处理器的长时间运行的应用程序,必须使用容错功能来防止发生故障后所有完成的计算全部丢失。尽管检查点对长时间容忍故障非常有用,但它通常会带来相当大的开销,尤其是当应用程序在检查点之间修改大量内存并且处理器数量很大时。在本文中,我们为高性能Linpack基准提出了一种基于算法的恢复方案(该方案在每次迭代中都会修改大量的内存),以在不检查点的情况下容忍故障停止故障。 Huang和Abraham证明,将矩阵分解后,将保留添加到矩阵的校验和。我们证明,对于右眼LU分解算法,校验和在计算的每个步骤中都得到维护。基于计算过程中每个步骤所维护的校验和关系,我们证明了高性能Linpack中的故障停止过程故障可以容忍而无需检查点。因为在计算过程中不需要定期的检查点,在恢复过程中也不需要回滚,所以所提出的恢复方案具有很高的可扩展性,并且具有扩展到极端规模计算及以后的潜力。在超级计算机Jaguar上的实验结果表明,所提出的恢复方案引入的容错开销可以忽略不计。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号