【24h】

Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs

机译:使用控制台和资源使用日志,以增加大型分布式系统中的错误处理时间窗口

获取原文

摘要

Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the impact of faults. Typically, the state of the underlying system is captured through messages that are recorded in a log file, which has been proven useful to system administrators in understanding the root-causes of system failures (and for their subsequent debugging). However, the time window between when the first error message is detected in the log file and time of the ensuing failure may not be large enough to allow the administrators to save the state of the running application, which will result in lost execution time. We thus address this fundamental question: Is it possible to extend this time window? The answer is positive: We show that, by using (i) resource usage logs to track anomalous resource usage and (ii) error logs to identify root-causes of system failures, it is possible to increase the time window, on average, by 50 minutes. These files were those obtained for the Ranger Supercomputer from TACC. We achieve this by applying anomaly detection techniques on resource usage data and conducting a root-cause analysis on error log files.
机译:资源密集型应用程序(例如科学应用程序)需要在其上执行的体系结构或系统显示出很高的可靠性,以减少故障的影响。通常,底层系统的状态是通过记录在日志文件中的消息捕获的,事实证明,这对于系统管理员了解系统故障的根本原因(及其后续调试)很有用。但是,从日志文件中检测到第一条错误消息到发生故障的时间之间的时间窗口可能不够大,无法允许管理员保存正在运行的应用程序的状态,这将导致执行时间损失。因此,我们解决了这个基本问题:是否可以延长此时间范围?答案是肯定的:我们显示出,通过使用(i)资源使用情况日志来跟踪异常资源使用情况,以及(ii)错误日志来确定系统故障的根本原因,平均而言,可以增加时间范围: 50分钟这些文件是从TACC为Ranger超级计算机获得的文件。我们通过对资源使用数据应用异常检测技术并对错误日志文件进行根本原因分析来实现这一目标。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号