Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, diUerent bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a GPU is bottlenecked by the available oU-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive.
展开▼