首页> 外文会议>High performance computing systems and applications >Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems
【24h】

Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems

机译:故障数据驱动的选择性节点级复制可改善高性能计算系统的MTTF

获取原文
获取原文并翻译 | 示例

摘要

This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model - driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.
机译:本文使用Los Alamos国家实验室在22个计算集群上收集的故障日志,对大型系统的故障行为进行了分析。我们注意到并非所有节点在系统中都显示出相似的故障行为。因此,我们的目标是对要进行重复复制的节点进行排序(一个接一个),以便在复制最少数量的节点后为系统实现目标MTTF。通过复制每个节点,我们得出了故障覆盖率的模型,并根据每个节点提供的覆盖率对节点进行了排序。与传统的随机选择要复制的节点的方法相比,我们的模型驱动方法根据目标MTTF的改进和系统中节点的故障分布,可提供从82%到1700%的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号