首页> 外文会议>E-Science Workshops, 2009 >Nonparametric multivariate anomaly analysis in support of HPC resilience
【24h】

Nonparametric multivariate anomaly analysis in support of HPC resilience

机译:支持HPC弹性的非参数多元异常分析

获取原文

摘要

Large-scale computing systems provide great potential for scientific exploration. However, the complexity that accompanies these enormous machines raises challenges for both, users and operators. The effective use of such systems is often hampered by failures encountered when running applications on systems containing tens-of-thousands of nodes and hundreds-of-thousands of compute cores capable of yielding petaflops of performance. In systems of this size failure detection is complicated and root-cause diagnosis difficult. This paper describes our recent work in the identification of anomalies in monitoring data and system logs to provide further insights into machine status, runtime behavior, failure modes and failure root causes. It discusses the details of an initial prototype that gathers the data and uses statistical techniques for analysis.
机译:大型计算系统为科学探索提供了巨大潜力。但是,这些巨大机器所伴随的复杂性给用户和操作员都带来了挑战。当在包含数万个节点和数十万个能够产生千万亿次性能的计算核心的系统上运行应用程序时,遇到的故障通常会阻碍此类系统的有效使用。在这种大小的系统中,故障检测很复杂,根本原因诊断很困难。本文介绍了我们最近在监视数据和系统日志中识别异常方面的工作,以提供对机器状态,运行时行为,故障模式和故障根本原因的进一步了解。它讨论了收集数据并使用统计技术进行分析的初始原型的详细信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号