A planning-based approach to failure recovery in distributed systems.

机译：一种基于计划的分布式系统故障恢复方法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Automated failure recovery in distributed systems poses a tough challenge because of myriad requirements and dependencies among its components. Moreover, failure scenarios are usually unpredictable so they cannot easily be foreseen. Therefore, it is not practical to enumerate all possible failure scenarios and a way to recover a distributed system for each of them. Due to this reason, present failure recovery techniques are highly manual and have considerable downtime associated with them. In this dissertation, we have developed a planning-based approach to automated failure recovery in distributed component-based systems. This approach automates failure recovery through continuous monitoring of the system. Therefore, an exact system state is always available with a failure monitor. When a failure is detected the monitor performs various checks to ensure that it is not a false positive or false negative. A dependency analyzer then checks effects of the failure on other parts of the system. After this an offline planning procedure is performed to take the system from a failed state to a working state. This planning is performed using an artificially intelligent (AI) planner. By using planning, this approach can be used to recover from a variety of failed states and reach any of several acceptable states: from minimal functionality to complete recovery. When a plan is calculated, it is executed onto the system to bring it back to a working state. We have evaluated this technique through various online and synthetic experiments performed on various distributed applications. Our results have shown that this is indeed an effective technique to automatically recover component-based distributed systems from a failure. Our results have also shown that this technique can also scale to large-scale distributed systems.

机译：分布式系统中的自动故障恢复带来了艰巨的挑战，因为其组件之间存在无数需求和依赖性。此外，故障情况通常是不可预测的，因此无法轻松预见。因此，枚举所有可能的故障情况以及为每个故障情况恢复分布式系统的方法是不切实际的。由于这个原因，当前的故障恢复技术是高度手动的，并且具有与其相关的大量停机时间。在本文中，我们开发了一种基于计划的方法来在基于分布式组件的系统中自动进行故障恢复。这种方法通过对系统进行连续监视来自动进行故障恢复。因此，故障监视器始终可以提供准确的系统状态。当检测到故障时，监视器将执行各种检查以确保其不是假阳性或假阴性。然后，依赖性分析器将检查故障对系统其他部分的影响。此后，将执行离线计划过程，以使系统从故障状态转变为工作状态。此计划是使用人工智能（AI）计划器执行的。通过使用计划，该方法可用于从各种失败状态中恢复并达到几个可接受的状态中的任何一个：从最小的功能到完整的恢复。计算计划后，将其执行到系统上以使其恢复工作状态。我们已经通过对各种分布式应用程序执行的各种在线和综合实验评估了该技术。我们的结果表明，这确实是一种从故障中自动恢复基于组件的分布式系统的有效技术。我们的结果还表明，该技术还可以扩展到大型分布式系统。

著录项

作者
Arshad, Naveed.;
展开▼
作者单位

University of Colorado at Boulder.;

展开▼
授予单位 University of Colorado at Boulder.;
学科 Computer Science.
学位 Ph.D.
年度 2006
页码 215 p.
总页数 215
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Fault Tolerant Environment Using Hardware Failure Detection, Rol Forward Recovery Approach and Microrebooting For Distributed Systems [J] . Bhushan Sapre, Anup Garje, Dr. B. B. Mesharm International Journal of Engineering Research and Applications . 2011,第3期

机译：使用硬件故障检测，Rol正向恢复方法和分布式系统的微重启的容错环境
2. Failure-recovery model with competition between failures in complex networks: a dynamical approach [J] . Valdez L. D., Di Muro M. A., Braunstein L. A. Journal of statistical mechanics: Theory and Experiment . 2016,第1期

机译：具有复杂网络中故障之间竞争的故障恢复模型：一种动态方法
3. Fast Failure Recovery in Vertex-Centric Distributed Graph Processing Systems [J] . Lu Wei, Shen Yanyan, Wang Tongtong, IEEE Transactions on Knowledge and Data Engineering . 2019,第4期

机译：以顶点为中心的分布式图形处理系统中的快速故障恢复
4. Robust control of a class of nonlinear uncertain systems. Fault tolerance against sensor failures and subsequent self recovery [C] . Zhihua Qu, Ihlefeld, C.M., . 2001

机译：一类非线性不确定系统的鲁棒控制。容错功能可防止传感器故障以及随后的自我恢复
5. An improved crash recovery approach for distributed systems. [D] . Ramidi, Harika Reddy. 2010

机译：改进的分布式系统崩溃恢复方法。
6. The Healthcare Administrators Associate: an experiment in distributed healthcare information systems. [O] . J. Fowler, G. Martin 1997

机译：医疗保健管理员助理：在分布式医疗保健信息系统中进行的实验。
7. Dealing with Failures During Failure Recovery of Distributed Systems ; CU-CS-1009-06 [O] . Arshad Naveed, Heimbigner Dennis, Wolf Alexander L 2006

机译：分布式系统故障恢复中的故障处理; CU-CS-1009-06
8. Dealing with Failures During Failure Recovery of Distributed Systems. [R] . Arshad, N., Heimbigner, D., Wolf, A. 2006

机译：处理分布式系统故障恢复过程中的故障。

A planning-based approach to failure recovery in distributed systems.

摘要

著录项

相似文献

相关主题

期刊订阅