High Performance Linpack Benchmark:A Fault Tolerant Implementation without Checkpointing

机译：高性能Linpack基准：无需检查点的容错实现

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. In this paper, we propose an algorithm-based recovery scheme for the High Performance Linpack benchmark (which modifies a large amount of memory in each iteration) to tolerate fail-stop failures without checkpointing. It was proved by Huang and Abraham that a checksum added to a matrix will be maintained after the matrix is factored. We demonstrate that, for the right-looking LU factorization algorithm, the checksum is maintained at each step of the computation. Based on this checksum relationship maintained at each step in the middle of the computation, we demonstrate that fail-stop process failures in High Performance Linpack can be tolerated without checkpointing. Because no periodical checkpoint is necessary during computation and no roll-back is necessary during recovery, the proposed recovery scheme is highly scalable and has a good potential to scale to extreme scale computing and beyond. Experimental results on the supercomputer Jaguar demonstrate that the fault tolerance overhead introduced by the proposed recovery scheme is negligible.

机译：随着高性能计算应用程序中使用的处理器数量的增加，在计算结束之前发生故障的可能性也会增加。对于使用大量处理器的长时间运行的应用程序，必须使用容错功能来防止发生故障后所有完成的计算全部丢失。尽管检查点对长时间容忍故障非常有用，但它通常会带来相当大的开销，尤其是当应用程序在检查点之间修改大量内存并且处理器数量很大时。在本文中，我们为高性能Linpack基准提出了一种基于算法的恢复方案（该方案在每次迭代中都会修改大量的内存），以在不检查点的情况下容忍故障停止故障。 Huang和Abraham证明，将矩阵分解后，将保留添加到矩阵的校验和。我们证明，对于右眼LU分解算法，校验和在计算的每个步骤中都得到维护。基于计算过程中每个步骤所维护的校验和关系，我们证明了高性能Linpack中的故障停止过程故障可以容忍而无需检查点。因为在计算过程中不需要定期的检查点，在恢复过程中也不需要回滚，所以所提出的恢复方案具有很高的可扩展性，并且具有扩展到极端规模计算及以后的潜力。在超级计算机Jaguar上的实验结果表明，所提出的恢复方案引入的容错开销可以忽略不计。

著录项

来源
《Proceedings of the 2011 ACM international conference on supercomputing.》|2011年|p.162-171|共10页
会议地点 Tucson AZ(US);Tucson AZ(US)
作者
Teresa Davies; Christer Karlsson; Hui Liu; Chong Ding; Zizhong Chen;
展开▼
作者单位

Colorado School of Mines Golden, CO, USA;

Colorado School of Mines Golden, CO, USA;

Colorado School of Mines Golden, CO, USA;

Colorado School of Mines Golden, CO, USA;

Colorado School of Mines Golden, CO, USA;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;计算技术、计算机技术;
关键词
high performance linpack benchmark; LU factorization; fault tolerance; algorithm-based recovery;

机译：高性能linpack基准； LU分解容错基于算法的恢复;

相似文献

外文文献
中文文献
专利

1. Performance-Based Fault Detection and Fault-Tolerant Control for Nonlinear Systems With T–S Fuzzy Implementation [J] . Huayun Han, Ying Yang, Linlin Li, Cybernetics, IEEE Transactions on . 2021,第2期

机译：具有T-S模糊实现的非线性系统的基于性能的故障检测和容错控制
2. Performance and effectiveness trade-off for checkpointing in fault-tolerant distributed systems [J] . Panagiotis Katsaros, Lefteris Angelis, Constantine Lazos Concurrency and Computation . 2007,第1期

机译：容错分布式系统中检查点的性能和有效性之间的权衡
3. High-Performance Fault-Tolerant xEmbedded Computing (HPFEC) Benchmark Suite [J] . NASA Tech Briefs . 2017,第2期

机译：高性能容错x嵌入式计算（HPFEC）基准套件
4. High Performance Linpack Benchmark:A Fault Tolerant Implementation without Checkpointing [C] . Teresa Davies, Christer Karlsson, Hui Liu, ACM international conference on supercomputing . 2011

机译：高性能LINPACK基准：没有检查点的容错实现
5. Fault tolerant Linux kernel 2.6 with checkpoint and recovery facility [D] . Park, Hyung Bae 2007

机译：具有检查点和恢复功能的容错Linux内核2.6
6. Sliding Mode Fault Tolerant Control for Unmanned Aerial Vehicle with Sensor and Actuator Faults [O] . Juan Tan, Yonghua Fan, Pengpeng Yan, 2019

机译：具有传感器和执行器故障的无人机滑模容错控制
7. Performance and Effectiveness Trade-Off for Checkpointing in Fault Tolerant Distributed Systems [O] . Panagiotis Katsaros, Lefteris Angelis, Constantine Lazos 2006

机译：容错分布式系统中检验点的性能和有效性权衡

High Performance Linpack Benchmark:A Fault Tolerant Implementation without Checkpointing

摘要

著录项

相似文献

相关主题

期刊订阅