Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs

机译：使用控制台和资源使用日志，以增加大型分布式系统中的错误处理时间窗口

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Resource-intensive applications such as scientific applications require the architecture or system on which they execute to display a very high level of dependability to reduce the impact of faults. Typically, the state of the underlying system is captured through messages that are recorded in a log file, which has been proven useful to system administrators in understanding the root-causes of system failures (and for their subsequent debugging). However, the time window between when the first error message is detected in the log file and time of the ensuing failure may not be large enough to allow the administrators to save the state of the running application, which will result in lost execution time. We thus address this fundamental question: Is it possible to extend this time window? The answer is positive: We show that, by using (i) resource usage logs to track anomalous resource usage and (ii) error logs to identify root-causes of system failures, it is possible to increase the time window, on average, by 50 minutes. These files were those obtained for the Ranger Supercomputer from TACC. We achieve this by applying anomaly detection techniques on resource usage data and conducting a root-cause analysis on error log files.

机译：资源密集型应用程序（例如科学应用程序）需要在其上执行的体系结构或系统显示出很高的可靠性，以减少故障的影响。通常，底层系统的状态是通过记录在日志文件中的消息捕获的，事实证明，这对于系统管理员了解系统故障的根本原因（及其后续调试）很有用。但是，从日志文件中检测到第一条错误消息到发生故障的时间之间的时间窗口可能不够大，无法允许管理员保存正在运行的应用程序的状态，这将导致执行时间损失。因此，我们解决了这个基本问题：是否可以延长此时间范围？答案是肯定的：我们显示出，通过使用（i）资源使用情况日志来跟踪异常资源使用情况，以及（ii）错误日志来确定系统故障的根本原因，平均而言，可以增加时间范围： 50分钟这些文件是从TACC为Ranger超级计算机获得的文件。我们通过对资源使用数据应用异常检测技术并对错误日志文件进行根本原因分析来实现这一目标。

著录项

来源
《IEEE International Conference on Trust, Security and Privacy in Computing and Communications;IEEE International Conference on Big Data Science and Engineering;IEEE International Symposium on Parallel and Distributed Processing with Applications》|2015年|61-68|共8页
会议地点
作者
Gurumdimma Nentawe; Jhumka Arshad; Liakata Maria; Chuah Edward; Browne James;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Anomaly detection; Error logs; PCA; Resource usage data; Root-cause analysis; large-scale HPC systems;

机译：异常检测;错误日志; PCA;资源使用情况数据;根本原因分析;大型HPC系统;

相似文献

外文文献
中文文献
专利

1. Time Stamp based global log and monitor approach to handle orphans in distributed systems [J] . Shamsdueen. E, V. Sundaram International journal of computer science and network security . 2011,第8期

机译：基于时间戳的全局日志和监视方法来处理分布式系统中的孤儿
2. Time Stamp based global log and monitor approach to handle orphans in distributed systems [J] . Shamsdueen. E, Dr. V. Sundaram International journal of computer science and network security . 2011,第8期

机译：基于时间戳的全局日志和监视方法来处理分布式系统中的孤儿
3. Execution anomaly detection in large-scale systems through console log analysis [J] . Bao Liang, Li Qian, Lu Peiyao, The Journal of Systems and Software . 2018,第sepa期

机译：通过控制台日志分析在大型系统中执行异常检测
4. Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems using Console and Resource Usage Logs [C] . Nentawe Gurumdimma, Arshad Jhumka, Maria Liakata, IEEE International Symposium on Parallel and Distributed Processing with Applications . 2015

机译：在使用控制台和资源使用日志中增加大规模分布式系统中的错误处理时间窗口
5. Distributed pipeline scheduling: A framework for design of large-scale, distributed, heterogeneous real-time systems. [D] . Chatterjee, Saurav. 1996

机译：分布式管道调度：一种用于设计大型，分布式，异构实时系统的框架。
6. Calibration of Linear Time-Varying Frequency Errors for Distributed ISAR Imaging Based on the Entropy Minimization Principle [O] . Hailong Kang, Jun Li, Hongyan Zhao, 2019

机译：基于熵最小化原理的分布式ISAR成像线性时变频率误差标定
7. Detecting Large-Scale System Problems by Mining Console Logs [O] . Wei Xu, Ling Huang Arm 2010

机译：通过挖掘控制台日志来检测大规模系统问题
8. Industrial Technology Modernization Program. Project 80. Increase Efficiency of Card Test/Device Test Areas by the Usage of Improved Material Handling Systems. Revision 1. Phase 2 [R] . Knox, R. 1988

机译：工业技术现代化计划。项目80.通过使用改进的物料搬运系统提高卡测试/设备测试区域的效率。修订1.第2阶段

Towards Increasing the Error Handling Time Window in Large-Scale Distributed Systems Using Console and Resource Usage Logs

摘要

著录项

相似文献

相关主题

期刊订阅