...
首页> 外文期刊>Journal of computational science >On-line soft error correction in matrix-matrix multiplication
【24h】

On-line soft error correction in matrix-matrix multiplication

机译:矩阵矩阵乘法中的在线软错误校正

获取原文
获取原文并翻译 | 示例
           

摘要

Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results cannot be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detected in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e. less than 1%) performance penalty over the atlas dgemm ().
机译:软错误是一次性事件,会破坏计算系统的状态,但不会破坏其整体功能。软错误通常不会中断受影响程序的执行,但是受影响的计算结果将不再可信赖。校正矩阵矩阵乘法中的软错误的一种众所周知的技术是基于算法的容错(ABFT)。虽然ABFT的效率要比三重模块冗余(TMR)(用于校正软错误的传统通用技术)好得多,但ABFT和TMR都可以在计算完成后离线检测错误。本文将传统的ABFT技术从离线扩展到在线,这样就可以在程序执行过程中的计算过程中检测矩阵矩阵乘法中的软错误,并通过在程序中校正损坏的计算来实现更高的效率。及时处理。实验结果表明,所提出的技术可以每十秒钟纠正一次错误,而在Atlas dgemm()上的性能损失可以忽略不计(即小于1%)。

著录项

  • 来源
    《Journal of computational science》 |2013年第6期|465-472|共8页
  • 作者单位

    Department of Computer Science and Engineering, University of California, Riverside, 900 University Avenue, Riverside, CA 92521, United States;

    Department of Electrical Engineering and Computer Science, Colorado School of Mines, Golden, CO 80401, United States;

    Department of Computer Science and Engineering, University of California, Riverside, 900 University Avenue, Riverside, CA 92521, United States;

    Department of Electrical Engineering and Computer Science, Colorado School of Mines, Golden, CO 80401, United States;

    Department of Electrical Engineering and Computer Science, Colorado School of Mines, Golden, CO 80401, United States;

    Department of Computer Science and Engineering, University of California, Riverside, 900 University Avenue, Riverside, CA 92521, United States;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Algorithm-based fault tolerance; Matrix multiplication; Fault tolerant linear algebra; On-line algorithm based fault tolerance;

    机译:基于算法的容错能力;矩阵乘法;容错线性代数;基于在线算法的容错;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号