首页> 外文学位 >A planning-based approach to failure recovery in distributed systems.
【24h】

A planning-based approach to failure recovery in distributed systems.

机译:一种基于计划的分布式系统故障恢复方法。

获取原文
获取原文并翻译 | 示例

摘要

Automated failure recovery in distributed systems poses a tough challenge because of myriad requirements and dependencies among its components. Moreover, failure scenarios are usually unpredictable so they cannot easily be foreseen. Therefore, it is not practical to enumerate all possible failure scenarios and a way to recover a distributed system for each of them. Due to this reason, present failure recovery techniques are highly manual and have considerable downtime associated with them. In this dissertation, we have developed a planning-based approach to automated failure recovery in distributed component-based systems. This approach automates failure recovery through continuous monitoring of the system. Therefore, an exact system state is always available with a failure monitor. When a failure is detected the monitor performs various checks to ensure that it is not a false positive or false negative. A dependency analyzer then checks effects of the failure on other parts of the system. After this an offline planning procedure is performed to take the system from a failed state to a working state. This planning is performed using an artificially intelligent (AI) planner. By using planning, this approach can be used to recover from a variety of failed states and reach any of several acceptable states: from minimal functionality to complete recovery. When a plan is calculated, it is executed onto the system to bring it back to a working state. We have evaluated this technique through various online and synthetic experiments performed on various distributed applications. Our results have shown that this is indeed an effective technique to automatically recover component-based distributed systems from a failure. Our results have also shown that this technique can also scale to large-scale distributed systems.
机译:分布式系统中的自动故障恢复带来了艰巨的挑战,因为其组件之间存在无数需求和依赖性。此外,故障情况通常是不可预测的,因此无法轻松预见。因此,枚举所有可能的故障情况以及为每个故障情况恢复分布式系统的方法是不切实际的。由于这个原因,当前的故障恢复技术是高度手动的,并且具有与其相关的大量停机时间。在本文中,我们开发了一种基于计划的方法来在基于分布式组件的系统中自动进行故障恢复。这种方法通过对系统进行连续监视来自动进行故障恢复。因此,故障监视器始终可以提供准确的系统状态。当检测到故障时,监视器将执行各种检查以确保其不是假阳性或假阴性。然后,依赖性分析器将检查故障对系统其他部分的影响。此后,将执行离线计划过程,以使系统从故障状态转变为工作状态。此计划是使用人工智能(AI)计划器执行的。通过使用计划,该方法可用于从各种失败状态中恢复并达到几个可接受的状态中的任何一个:从最小的功能到完整的恢复。计算计划后,将其执行到系统上以使其恢复工作状态。我们已经通过对各种分布式应用程序执行的各种在线和综合实验评估了该技术。我们的结果表明,这确实是一种从故障中自动恢复基于组件的分布式系统的有效技术。我们的结果还表明,该技术还可以扩展到大型分布式系统。

著录项

  • 作者

    Arshad, Naveed.;

  • 作者单位

    University of Colorado at Boulder.;

  • 授予单位 University of Colorado at Boulder.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2006
  • 页码 215 p.
  • 总页数 215
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号