首页> 外文会议>2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks – Supplemental Volume >Towards Predicting the Impact of Roll-Forward Failure Recovery for HPC Applications
【24h】

Towards Predicting the Impact of Roll-Forward Failure Recovery for HPC Applications

机译:预测前滚故障恢复对HPC应用程序的影响

获取原文
获取原文并翻译 | 示例

摘要

The roll-forward recovery schemes on HPC systems implicitly trade off faster time to solution for higher risk: as it usually performs a probabilistic repair, this may cause further failures such as SDCs. It is essential for users to be able to reason about the impact of a particular repair exercised by the scheme. Towards this goal, we identify two research questions aiming to determine the outcome of a repair either at the failure point or at the end of the execution. For the former, we propose a promising hybrid approach that combines machine learning and error propagation analysis techniques.
机译:HPC系统上的前滚恢复方案暗中权衡了更快的解决方案来解决更高的风险:由于它通常执行概率修复,因此可能会导致进一步的故障,例如SDC。对于用户来说,至关重要的是能够推理出该方案所进行的特定维修的影响。为了实现这一目标,我们确定了两个研究问题,旨在确定在故障点或执行结束时的维修结果。对于前者,我们提出了一种有前途的混合方法,该方法结合了机器学习和错误传播分析技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号