首页> 外文会议>Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Supplemental Volume >Towards Predicting the Impact of Roll-Forward Failure Recovery for HPC Applications
【24h】

Towards Predicting the Impact of Roll-Forward Failure Recovery for HPC Applications

机译:为了预测HPC应用的延期故障恢复的影响

获取原文

摘要

The roll-forward recovery schemes on HPC systems implicitly trade off faster time to solution for higher risk: as it usually performs a probabilistic repair, this may cause further failures such as SDCs. It is essential for users to be able to reason about the impact of a particular repair exercised by the scheme. Towards this goal, we identify two research questions aiming to determine the outcome of a repair either at the failure point or at the end of the execution. For the former, we propose a promising hybrid approach that combines machine learning and error propagation analysis techniques.
机译:HPC系统上的升级恢复方案可以更快地折衷较高的风险时间:由于它通常执行概率维修,这可能导致SDC等进一步的失败。用户必须能够推理该方案行使的特定修复的影响。为了实现这一目标,我们确定了两项研究问题,旨在确定在失败点或执行结束时修复的结果。对于前者来说,我们提出了一种充满希望的混合方法,将机器学习和错误传播分析技术结合起来。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号