首页> 外文学位 >Multi-Layer Fault Tolerance Techniques for High Reliability and Performance: Devices, Systems and Data Centers.
【24h】

Multi-Layer Fault Tolerance Techniques for High Reliability and Performance: Devices, Systems and Data Centers.

机译:具有高可靠性和高性能的多层容错技术:设备,系统和数据中心。

获取原文
获取原文并翻译 | 示例

摘要

In cloud computing data centers, failures may propagate quickly and widely, affecting many physical machines and users. Particularly, soft error is one of the major sources that can cause failures in computer systems. Soft errors may occur in various components in data centers, such as CPU, main memory, flash storage, etc, causing system failures, virtual machine (VM) failures, application abnormal abort, or even silent data corruptions (SDC).;Fault tolerance techniques have been proposed at various levels in cloud computing data centers. However, they have limitations in handling soft errors. At the device level, due to the increasing error rates in flash storage, stronger error-correction code (ECC) is required to handle multi-bit errors, resulting in additional overhead in performance, energy, and area. At the system level, while VM-checkpointing is effective to protect selected VMs, the virtualization infrastructure itself, that provides this functionality, is not well protected from soft errors. Soft errors may still cause failures in the virtualization infrastructure affecting all VMs within it. At the data center level, dynamic Voltage/Frequency Scaling (DVFS) may change the CPU voltage to reduce CPU power consumption. However, the reduced voltage increases soft error rates in CPU and potentially causes more soft-error-induced failures. If DVFS is not utilized properly, it may increase the overall operation cost.;To improve the reliability of cloud computing systems against soft errors, we design techniques at the device, virtualization system and data center levels to address these limitations. At the device level, we propose a new flash storage architecture, SoftFlash, which aims to reduce the need of strong ECC by leveraging the inherent error tolerance capability in application data. Our results show that for many data-centric applications, the proposed SoftFlash system can achieve acceptable results (or better in certain cases), with 40% performance improvement and a third of the energy consumption.;At the data center level, we propose a data center management framework, DUAL, which consists of new virtual machine power and reliability analysis tools. The framework is designed to balance dual needs of a data center, that is, reducing energy consumption and providing high reliability. The evaluations show that DUAL can help maintain the desired reliability and reduce power consumption, which in turn lowers the overall operational cost of a data center.;At the virtualization system level, we conduct in-depth analysis of the reliability risks of the virtualization infrastructure. Based on the analysis, we design Xentry that focuses on limiting error propagation within and from the hypervisor. The experiment results show that Xentry incurs very small performance overhead and detects over 99% of the injected faults. To further improve the reliability of the hypervisor, we design and implement redundant hypervisor execution, DualVisor, to provide both recovery and detection capability. We evaluate various design parameters, and selectively replicate hypervisor executions. DualVisor covers 87% of the total number of hypervisor executions with only less than 6% overhead (with 2 to 4 VMs).
机译:在云计算数据中心中,故障可能会迅速广泛传播,从而影响许多物理机和用户。特别是,软错误是可能导致计算机系统故障的主要来源之一。软错误可能发生在数据中心的各个组件中,例如CPU,主内存,闪存等,从而导致系统故障,虚拟机(VM)故障,应用程序异常中止甚至静默数据损坏(SDC)。在云计算数据中心的各个级别提出了一些技术。但是,它们在处理软错误方面有局限性。在设备级别,由于闪存存储中错误率的增加,需要更强的纠错码(ECC)来处理多位错误,从而导致性能,能耗和面积上的额外开销。在系统级别,虽然虚拟机检查点可以有效地保护选定的虚拟机,但是提供此功能的虚拟化基础架构本身并没有得到很好的保护,免受软错误的侵害。软错误仍可能导致虚拟化基础架构发生故障,从而影响其中的所有VM。在数据中心级别,动态电压/频率缩放(DVFS)可能会更改CPU电压以降低CPU功耗。但是,降低的电压会增加CPU中的软错误率,并可能导致更多由软错误引起的故障。如果未正确使用DVFS,则可能会增加总体运营成本。为了提高云计算系统针对软错误的可靠性,我们在设备,虚拟化系统和数据中心级别设计了技术来解决这些限制。在设备级别,我们提出了一种新的闪存存储架构SoftFlash,旨在通过利用应用程序数据中固有的容错能力来减少对强大ECC的需求。我们的结果表明,对于许多以数据为中心的应用,所提出的SoftFlash系统可以达到可接受的结果(在某些情况下甚至更好),性能提高40%,能耗降低三分之一。数据中心管理框架DUAL,由新的虚拟机功能和可靠性分析工具组成。该框架旨在平衡数据中心的双重需求,即减少能耗并提供高可靠性。评估表明,DUAL可以帮助维持所需的可靠性并降低功耗,从而降低数据中心的总体运营成本。;在虚拟化系统级别,我们对虚拟化基础架构的可靠性风险进行了深入分析。基于分析,我们设计了Xentry,其重点是限制虚拟机管理程序内部和之间的错误传播。实验结果表明,Xentry的性能开销很小,并且可以检测到超过99%的注入故障。为了进一步提高管理程序的可靠性,我们设计并实现了冗余管理程序执行DualVisor,以提供恢复和检测功能。我们评估各种设计参数,并有选择地复制虚拟机监控程序执行。 DualVisor覆盖了虚拟机管理程序执行总数的87%,而开销却不到6%(使用2到4个VM)。

著录项

  • 作者

    Xu, Xin.;

  • 作者单位

    The George Washington University.;

  • 授予单位 The George Washington University.;
  • 学科 Computer engineering.
  • 学位 Ph.D.
  • 年度 2015
  • 页码 174 p.
  • 总页数 174
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号