【24h】

A Software-Managed Approach to Die-Stacked DRAM

机译:芯片堆叠DRAM的软件管理方法

获取原文
获取原文并翻译 | 示例

摘要

Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.
机译:芯片堆叠(3D)技术的进步已使大量DRAM与高性能计算逻辑紧密集成。如何将这项技术集成到计算系统的整体架构中是一个悬而未决的问题。尽管最近有很多工作集中在基于硬件的技术上,以使用芯片堆叠的内存(例如缓存),但在本文中,我们探索了软件驱动方法有效所需的条件。首先,我们考虑依赖于快速片上DRAM和慢速片外DRAM之间分配的静态分区,直接将裸芯片DRAM暴露给应用程序。我们只看到这种方法带来的边际收益(加速9%)。接下来,我们探索基于OS的页面缓存,该页面缓存可动态划分应用程序内存,但是我们发现这种方法比根本没有堆叠的DRAM更糟糕!我们分析了OS页面缓存中的性能瓶颈,并提出了两种简单的技术来使OS方法可行。第一个是硬件辅助的TLB击落,这是一种更通用的机制,它在堆栈式DRAM之外非常有价值,并且使OS管理的页面缓存的速度提高了27%,第二个是软件实现的预取器,它扩展了经典功能硬件预取算法到页面级别,从而使速度提高了39%。使用这些简单而轻巧的组件,OS页缓存可以提供70%的性能优势,而理想的,不切实际的系统将所有主内存都进行裸片堆叠,则可以实现该性能优势。但是,我们还发现本地性较差的应用程序(例如,图形分析)不适合任何页面缓存方案(无论是硬件还是软件),因此我们建议系统仍向应用程序层提供API以显式控制芯片堆叠的DRAM分配。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号