首页> 外文会议>2016 IEEE 35th Symposium on Reliable Distributed Systems >CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems
【24h】

CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems

机译:粗体:结合资源使用情况数据和错误日志以在大型分布式系统中进行准确的错误检测

获取原文
获取原文并翻译 | 示例

摘要

The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.
机译:事实证明,在大型分布式系统中使用控制台日志进行错误检测对系统管理员很有用。但是,此类日志通常是多余且不完整的,因此很难进行准确的检测。为了提高准确性,我们用资源使用情况数据补充了这些不完整的控制台日志,这些数据捕获了系统中每个作业的资源利用情况。然后,我们开发一种新颖的错误检测方法,即CRUDE方法,该方法同时利用了资源使用数据和控制台日志。因此,我们做出了以下具体的技术贡献:我们开发(i)聚类算法以将具有相似行为的节点进行分组,(ii)异常检测算法以识别具有异常资源使用情况的作业,(iii)将作业与异常资源链接在一起的算法错误节点使用。然后,我们使用控制台日志和Ranger超级计算机的资源使用数据评估我们的方法。我们的结果是肯定的:(i)我们的方法以大约80%的真实阳性率检测错误,并且(ii)与著名的Nodeinfo错误检测算法相比,我们的算法比Nodeinfo的平均改进率约为85% ,最佳情况下可提高250%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号