首页> 外文学位 >An efficient design space exploration framework to optimize power-efficient heterogeneous many-core multi-threading embedded processor architectures.
【24h】

An efficient design space exploration framework to optimize power-efficient heterogeneous many-core multi-threading embedded processor architectures.

机译:一个有效的设计空间探索框架,用于优化省电的异构多核多线程嵌入式处理器体系结构。

获取原文
获取原文并翻译 | 示例

摘要

By the middle of this decade, uniprocessor architecture performance had hit a roadblock due to a combination of factors, such as excessive power dissipation due to high operating frequencies, growing memory access latencies, diminishing returns on deeper instruction pipelines, and a saturation of available instruction level parallelism in applications. An attractive and viable alternative embraced by all the processor vendors was multi-core architectures where throughput is improved by using micro-architectural features such as multiple processor cores, interconnects and low latency shared caches integrated on a single chip. The individual cores are often simpler than uniprocessor counterparts, use hardware multi-threading to exploit thread-level parallelism and latency hiding and typically achieve better performance-power figures. The overwhelming success of the multi-core microprocessors in both high performance and embedded computing platforms motivated chip architects to dramatically scale the multi-core processors to many-cores which will include hundreds of cores on-chip to further improve throughput. With such complex large scale architectures however, several key design issues need to be addressed. First, a wide range of micro-architectural parameters such as L1 caches, load/store queues, shared cache structures and interconnection topologies and non-linear interactions between them define a vast non-linear multi-variate micro-architectural design space of many-core processors; the traditional method of using extensive in-loop simulation to explore the design space is simply not practical. Second, to accurately evaluate the performance (measured in terms of cycles per instruction (CPI)) of a candidate design, the contention at the shared cache must be accounted in addition to cycle-by-cycle behavior of the large number of cores which superlinearly increases the number of simulation cycles per iteration of the design exploration. Third, single thread performance does not scale linearly with number of hardware threads per core and number of cores due to memory wall effect. This means that at every step of the design process designers must ensure that single thread performance is not unacceptably slowed down while increasing overall throughput. While all these factors affect design decisions in both high performance and embedded many-core processors, the design of embedded processors required for complex embedded applications such as networking, smart power grids, battlefield decision-making, consumer electronics and biomedical devices to name a few, is fundamentally different from its high performance counterpart because of the need to consider (i) low power and (ii) real-time operations. This implies the design objective for embedded many-core processors cannot be to simply maximize performance, but improve it in such a way that overall power dissipation is minimized and all real-time constraints are met. This necessitates additional power estimation models right at the design stage to accurately measure the cost and reliability of all the candidate designs during the exploration phase.;In this dissertation, a statistical machine learning (SML) based design exploration framework is presented which employs an execution-driven cycle-accurate simulator to accurately measure power and performance of embedded many-core processors. The embedded many-core processor domain is Network Processors (NePs) used to processed network IP packets. Future generation NePs required to operate at terabits per second network speeds captures all the aspects of a complex embedded application consisting of shared data structures, large volume of compute-intensive and data-intensive real-time bound tasks and a high level of task (packet) level parallelism. Statistical machine learning (SML) is used to efficiently model performance and power of candidate designs in terms of wide ranges of micro-architectural parameters. The method inherently minimizes number of in-loop simulations in the exploration framework and also efficiently captures the non-linear interactions between the micro-architectural design parameters. To ensure scalability, the design space is partitioned into (i) core-level micro-architectural parameters to optimize single core architectures subject to the real-time constraints and (ii) shared memory level micro-architectural parameters to explore the shared interconnection network and shared cache memory architectures and achieves overall optimality. The cost function of our exploration algorithm is the total power dissipation which is minimized, subject to the constraints of real-time throughput (as determined from the terabit optical network router line-speed) required in IP packet processing embedded application.
机译:到本世纪中叶,由于多种因素的组合,单处理器体系结构的性能遇到了障碍,例如高工作频率导致功耗过大,内存访问等待时间增加,更深的指令流水线上的收益递减以及可用指令的饱和应用程序中的级别并行性。所有处理器供应商都采用的有吸引力且可行的替代方法是多核体系结构,该体系结构通过使用微体系结构功能(例如,多个处理器内核,互连和集成在单个芯片上的低延迟共享缓存)提高了吞吐量。各个内核通常比单处理器内核更简单,使用硬件多线程来利用线程级并行性和延迟隐藏,并且通常可以获得更好的性能和功耗指标。多核微处理器在高性能和嵌入式计算平台中的压倒性成功,促使芯片设计师将多核处理器显着地扩展到许多核,其中将包括数百个片上核,以进一步提高吞吐量。然而,对于如此复杂的大规模体系结构,需要解决几个关键的设计问题。首先,各种微体系结构参数(例如L1缓存,加载/存储队列,共享的缓存结构和互连拓扑以及它们之间的非线性交互)定义了一个庞大的非线性多元微结构设计空间,其中包括核心处理器;使用广泛的回路内仿真来探索设计空间的传统方法根本不可行。其次,要准确评估候选设计的性能(以每条指令的周期数(CPI)衡量),除了要考虑超大容量内核的逐周期行为外,还必须考虑共享缓存中的争用情况增加了设计探索的每次迭代的仿真周期数。第三,由于内存壁效应,单线程性能不会随每个内核的硬件线程数量和内核数量线性增长。这意味着设计人员在设计过程的每一步都必须确保在提高整体吞吐量的同时,不降低单线程性能。尽管所有这些因素都会影响高性能和嵌入式多核处理器的设计决策,但复杂嵌入式应用(例如网络,智能电网,战场决策,消费电子产品和生物医学设备)所需的嵌入式处理器的设计仅举几例由于需要考虑(i)低功耗和(ii)实时操作,因此从根本上不同于其高性能同类产品。这意味着嵌入式多核处理器的设计目标不能只是简单地最大化性能,而不能以使总功耗最小化并满足所有实时约束的方式对其进行改进。这就需要在设计阶段就需要额外的功耗估算模型,以在测量阶段准确测量所有候选设计的成本和可靠性。本文提出了一种基于统计机器学习(SML)的设计探索框架,该框架采用了执行驱动的精确周期仿真器,可精确测量嵌入式多核处理器的功率和性能。嵌入式多核处理器域是用于处理网络IP数据包的网络处理器(NeP)。需要以每秒兆兆位网络速度运行的下一代NeP捕获了复杂的嵌入式应用程序的所有方面,包括共享数据结构,大量计算密集型和数据密集型实时绑定任务以及高级别任务(数据包) )级并行性。统计机器学习(SML)用于根据各种微体系结构参数有效地对候选设计的性能和能力进行建模。该方法固有地最小化了探索框架中的环内仿真的数量,并且还有效地捕获了微体系结构设计参数之间的非线性相互作用。为了确保可扩展性,将设计空间划分为(i)核心级微体系结构参数以根据实时约束优化单核体系结构,以及(ii)共享内存级微体系结构参数以探索共享的互连网络和共享高速缓存存储器体系结构,并实现了整体优化。我们的探索算法的成本函数是将总功耗降至最低,这受IP数据包处理嵌入式应用程序所需的实时吞吐量(由兆位光网络路由器线速确定)的约束。

著录项

  • 作者

    Datta, Kushal.;

  • 作者单位

    The University of North Carolina at Charlotte.;

  • 授予单位 The University of North Carolina at Charlotte.;
  • 学科 Engineering Computer.;Engineering Electronics and Electrical.
  • 学位 Ph.D.
  • 年度 2011
  • 页码 130 p.
  • 总页数 130
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号