首页> 外文学位 >Two-dimensional memory system protection.
【24h】

Two-dimensional memory system protection.

机译:二维存储系统保护。

获取原文
获取原文并翻译 | 示例

摘要

In modern computer systems, the memory system plays a key role in determining the computer system's overall performance and power consumption. However, the memory system is also the most vulnerable component in the system that directly impacts the system's overall manufacturing costs and run-time reliability. As fabrication process technologies scale into the deep nanometer regime, both the frequency and scale of manufacturing defects (mostly caused by variability errors) and run-time errors (mostly caused by soft errors and wearouts) will increase. These errors will cause high manufacturing costs, information losses, and physical failures. However, conventional memory protection techniques such as error correcting codes (ECC) and memory redundancy cannot handle errors that occur in such an increasing frequency and cannot scale without incurring high VLSI overheads.;This thesis first proposes 2D error coding, a scalable multi-bit error protection technique applied 'within' embedded memory arrays, which combines in-line small-scale error correction and off-line large-scale error correction to detect and correct large-scale information losses (e.g., multi-bit upsets) at minimum VLSI overheads. This thesis evaluates this scheme in the cache hierarchies of two chip multiprocessor designs and shows that 2D error coding can correct clustered errors up to 32x32 bits during run time with significantly smaller performance, area, and power overheads than conventional techniques.;Next, this thesis investigates how this increased resilience can be traded off for higher-density bitcells, higher cell performance, greater cell stability, and lower power design by correcting variability-induced manufacture-time hard errors in embedded memory arrays, while maintaining ∼100% yield. By conducting a series of Monte Carlo simulations of scaled cell models with device variability, this thesis first identifies a strong potential of using multi-bit ECC for variability tolerance, and then proposes 2D erasure coding, a low-overhead multi-bit ECC designed to correct variability-induced manufacture-time hard errors at the speed of conventional single-bit ECC by making use of erasure coding algorithm. The proposed scheme when combined with a small amount of row redundancy significantly improves the memory access latency, power, and stability, while maintaining ∼100% yield and run-time reliability.;This thesis proposes RunFlat memory, a highly reliable, available, and serviceable (RAS) distributed shared-memory (DSM) system to survive large-scale run-time hard errors such as node failures. RunFlat memory applies 2D protection 'across' off-chip memory arrays by combining a conventional block-level protection (e.g., ECC, 2D coding) and a node-level memory RAID protection. RunFlat memory combined with a hardware-based on-line memory reconfiguration mechanism can detect and correct entire node failures, enable continued operation, and allow on-line repair service, while preserving the system's original performance and protection. Full-system simulations of a 16-node DSM server show that RunFlat memory incurs a negligible performance overhead during error free mode and significantly reduced performance overheads when operating with a failed node.;This thesis proposes two-dimensional (2D) memory protection techniques for building highly reliable, available, and serviceable memory systems while maintaining low manufacturing costs and high yields. The key innovation of 2D memory protection is to take reconstruction of large-scale information loss off the critical path of normal operations, that is distinct from low-overhead small-scale error detection and correction mechanisms. 2D memory protection can be applied at various levels of the memory system from on-chip memory arrays to off-chip memory modules and nodes. This thesis proposes and evaluates three distinct applications of 2D memory protection techniques: 2D error coding, 2D erasure coding, and RunFlat memory to combat multi-bit errors, variability errors, and node failures, respectively.
机译:在现代计算机系统中,内存系统在确定计算机系统的整体性能和功耗方面起着关键作用。但是,内存系统也是系统中最易受攻击的组件,它直接影响系统的整体制造成本和运行时可靠性。随着制造工艺技术扩展到深纳米技术领域,制造缺陷的频率和规模(主要是由可变性错误引起)和运行时错误(主要是由软错误和磨损引起)都会增加。这些错误将导致高制造成本,信息丢失和物理故障。但是,传统的内存保护技术,例如纠错码(ECC)和内存冗余无法处理以这种频率增加而出现的错误,并且在不产生高VLSI开销的情况下就无法进行扩展。本文首先提出了2D错误编码,一种可扩展的多比特错误保护技术应用了“内部”嵌入式存储器阵列,该技术结合了在线小规模纠错和离线大范围纠错,以最小的VLSI来检测和纠正大规模信息丢失(例如,多位翻转)间接费用。本文在两个芯片多处理器设计的缓存层次结构中对该方案进行了评估,结果表明,二维错误编码可以在运行时纠正高达32x32位的群集错误,并且性能,面积和功耗开销比传统技术小得多。通过校正嵌入式存储器阵列中因变化引起的制造时的硬错误,同时保持约100%的良率,研究了如何在更高密度的位元,更高的单元性能,更高的单元稳定性和更低的功耗设计之间权衡这种增加的弹性。通过对带有设备可变性的缩放单元模型进行一系列的蒙特卡洛模拟,本文首先确定了使用多比特ECC进行可变性容忍的强大潜力,然后提出了2D擦除编码,一种低开销的多比特ECC设计用于通过使用擦除编码算法,以传统的单比特ECC的速度纠正由变异性引起的制造时硬错误。所提出的方案与少量的行冗余相结合,可以显着改善存储器访问的延迟,功耗和稳定性,同时保持〜100%的良率和运行时可靠性。可服务(RAS)分布式共享内存(DSM)系统,以应对诸如节点故障之类的大规模运行时硬错误。 RunFlat存储器通过结合常规的块级保护(例如ECC,2D编码)和节点级存储器RAID保护,在整个片外存储器阵列上应用2D保护。 RunFlat内存与基于硬件的在线内存重新配置机制相结合,可以检测并纠正整个节点故障,实现连续操作,并允许在线维修服务,同时保留系统的原始性能和保护。对16节点DSM服务器的全系统仿真表明,RunFlat内存在无错误模式下产生的性能开销可以忽略不计,并且在出现故障的节点上运行时,性能开销可以大大降低。建立高度可靠,可用和可维修的内存系统,同时保持较低的制造成本和高产量。 2D存储器保护的关键创新是将大规模信息丢失的重建从正常操作的关键路径中转移出来,这与低开销的小规模错误检测和纠正机制不同。 2D存储器保护可以应用于从片上存储器阵列到片外存储器模块和节点的各种存储系统级别。本文提出并评估了2D存储器保护技术的三种不同应用:2D错误编码,2D擦除编码和RunFlat存储器,分别用于应对多位错误,可变性错误和节点故障。

著录项

  • 作者

    Kim, Jangwoo.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Engineering Electronics and Electrical.;Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 132 p.
  • 总页数 132
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号