您现在的位置: 首页> 研究主题> 数据预取

数据预取

数据预取的相关文献在1997年到2022年内共计300篇,主要集中在自动化技术、计算机技术、无线电电子学、电信技术、系统科学 等领域,其中期刊论文72篇、会议论文18篇、专利文献703020篇;相关期刊41种,包括国防科技大学学报、计算机工程、计算机工程与科学等; 相关会议17种,包括华东师范大学“数据科学与工程”论坛内存计算数据管理主题报告会、2014全国高性能计算学术年会、2012中国计算机大会等;数据预取的相关文献由636位作者贡献,包括约翰.M.吉尔、不公告发明人、罗德尼.E.虎克等。

数据预取—发文量

期刊论文>

论文:72 占比:0.01%

会议论文>

论文:18 占比:0.00%

专利文献>

论文:703020 占比:99.99%

总计:703110篇

数据预取—发文趋势图

数据预取

-研究学者

  • 约翰.M.吉尔
  • 不公告发明人
  • 罗德尼.E.虎克
  • 刘天义
  • 沈海华
  • 章隆兵
  • 肖俊华
  • 刘鹏
  • 漆锋滨
  • 佟冬
  • 期刊论文
  • 会议论文
  • 专利文献

搜索

排序:

年份

    • 蔡雨; 孙成国; 杜朝晖; 刘子行; 康梦博; 李双双
    • 摘要: 异构HPL(high-performance Linpack)效率的提高需要充分发挥加速部件和通用CPU计算能力,加速部件集成了更多的计算核心,负责主要的计算,通用CPU负责任务调度的同时也参与计算.在合理划分任务、平衡负载的前提下,优化CPU端计算性能对整体效率的提升尤为重要.针对具体平台体系结构特点对BLAS(basic linear algebra subprograms)函数进行优化往往可以更加充分地利用通用CPU计算能力,提高系统整体效率.BLIS(BLAS-like library instantiation software)算法库是开源的BLAS函数框架,具有易开发、易移植和模块化等优点.基于异构系统平台体系结构以及HPL算法特点,充分利用三级缓存、向量化指令和多线程并行等技术手段优化CPU端调用的各级BLAS函数,应用auto-tuning技术优化矩阵分块参数,从而形成了HygonBLIS算法库.与MKL相比,在异构环境下,HPL算法整体性能提高了11.8%.
    • Zhang Qianlong; Hou Rui; Yang Sibo; Zhang Lixin
    • 摘要: 详细分析了已有针对链表结构(LDS)的预取方法,并分析了预取深度对预取性能的影响,同时分析了链表结构中单个生产者访存指令对应多个消费者访存指令的情况,并指出了现有链表结构预取器的不足.提出了针对链表结构的反馈预取机制,在原来预取器的基础上把预取命令查询处理器Cache的结果反馈给预取引擎,预取引擎根据反馈结果决定进一步预取操作.如果预取命令查询Cache发现已经命中,则反馈查询结果给预取器,预取器再针对同一个生产者指令产生预取命令.反馈预取机制可以和其他链式结构预取机制协同工作.实验结果表明,相比于无反馈的预取机制,针对链表结构的反馈预取机制当预取深度为1时,每周期执行指令数(IPC)平均提升8.14%,L1-D Cache缺失率平均降低11.18%,新增硬件开销几乎可以忽略.
    • 裴颂文; 赵梦旖; 姬燕飞
    • 摘要: Due to the existing data prefetching algorithms can’t meet the requirements of the novel heterogeneous memory system combining the dynamic random access memory (DRAM) with the non-volatile memory (NVM) in high energy-efficiency heterogeneous computing systems, a simulated annealing data prefetching algorithm (SADPA) was proposed. It was a heuristic search inspired simulated annealing algorithm, in which a random factor was introduced to confirm the global optimal threshold and the valid number of prefetching NVM pages. The results show that the average accessing latency of SADPA is 4% lower than that of the static threshold adjustment algorithm, and the average instruction per cycle (IPC) of the SADPA is 10.1% greater than that of the static threshold adjustment algorithm. Besides, the systemic power supported by SADPA, as for the cactusADM, is reduced by 3.4% compared with the cooperative hardware/software dynamic threshold adjustment algorithm.%鉴于现有的数据预取算法不能满足高效能异构计算系统对动态随机存取存储器(DRAM)和非易失性存储器(NVM)相结合的新型异构存储器高效访问的要求,提出了一种模拟退火的全局优化数据预取算法(SADPA)。该算法在启发式搜索模拟退火算法的基础上,引入了随机因子,以避免局部最优,从而确定了全局优化阈值以预取NVM页面的有效数量。实验结果表明,该算法相对于静态阈值调整算法,平均访问延时降低了4%,每个时钟周期内的平均指令数(IPC)增加了10.1%;对于cactusADM应用,该算法相对于软硬件协同的动态阈值调整算法,系统能耗降低了3.4%。
    • 夏苑; 何映思
    • 摘要: In order to overcome the shortcomings of the traditional data prefetching scheme ,this paper pro-poses a push data prefetching scheme in a distributed file system .It can reduce the amount of data commu-nication and network communication delay .At the same time ,it can reduce the client's burden of work by freeing the computing nodes running the client file system from tracking I/O operation and predicting I/O operation.So our data prefetching scheme improves the performance of storage system .%为了克服传统的数据预取机制的不足,提出了一种在分布式文件系统中存储服务器端的推送式数据预取机制 .在达到减少网络通信的数据量和通信时延目的的同时,可以让运行客户端文件系统的计算节点从跟踪I/O操作和预测I/O操作的工作中脱离出来,减轻了客户端的工作负担,提高了存储系统的性能 .
    • 董文菁; 温东新; 张展
    • 摘要: 当今时代数据呈现出指数级增长效应,更多的组织采用多数据中心和分布式来存储数据,Alluxio作为以内存为中心的虚拟分布式存储系统,整合了底层大数据生态系统.在Alluxio与底层存储结合的远程场景中,由于网络的延迟,使得I/0速度成为影响对外服务的重要因素之一.针对以上研究提出一种基于Alluxio远程场景下的缓存策略CPR,利用存储系统中数据块之间的关联性指导数据预取与替换,采用分组思想提高关联规则的利用率,启用后台线程实时更新规则集,并通过仿真实验验证策略的有效性.仿真结果表明,CPR策略指导下的I/O性能要优于Alluxio现有的缓存策略和一些基于数据块间关联规则的缓存策略.
    • 刘天义; 肖俊华; 章隆兵; 沈海华
    • 摘要: The pointer chasing pattern of processor memory operations is analyzed,and pointer chasing operations' prob-lems of low data prefetching accuracy and long memory access latency in linked data applications are pointed out. To improve processors' pointer chasing access performance,an instruction label assisted memory prefetching(IL-AMP)technology is proposed.The technology is a prefetching mechanism under instruction label prompting.It adds a new access instruction to the instruction set architecture to make the instruction generate a special access label in the stage of decoding to demonstrate that the access operations' loading content is the pointer.The label can be sent through the memory hierarchy to the memory controller if Cache miss occurs.When the required pointer comes back to the memory controller from DRAM, the logic can make a prefetching operation immediately hide memory requests latency in the future.Experimental results show that the ILAMP technology can reduce the memory latency by 15%on average.The prefetching accuracy is above 77%for all programs.The bandwidth over-head increases only about 10%and the footprint is about 1kB.%分析了处理器访存操作的指针追逐模式,指出了链式数据应用中的指针追逐操作的数据预取准确率低、访存延迟大的问题.为了提升处理器指针追逐访存性能,提出了指令标签辅助的数据预取(ILAMP)技术.ILAMP技术是一种指令标签提示的预取机制,其通过在指令集架构中添加新的访存指令,使该指令在处理器译码阶段产生特殊访存标签,指明该访存操作的加载内容是指针.在Cache缺失的情况下,该标签一直传递到内存控制器.当加载的指针返回内存控制器时,则提取指针、发出预取请求.实验结果表明, ILAMP技术与无ILAMP情况相比,ILAMP技术降低DRAM读请求的平均访问延迟的平均值约为15%,预取精度高于77%,访存带宽增加10%左右,硬件开销约为1kB.
    • 姚敏; 尹建伟; 唐彦; 罗智凌
    • 摘要: 传统数据去重备份系统在大数据应用场景下存在备份存储空间过大和数据吞吐量不足等缺点.为此,基于数据路由设计一种分布式备份数据去重系统.该系统以数据片为去重粒度,具有数据路由和数据预取2个功能.数据路由使用布隆过滤器对需要处理的数据片进行路由查询,数据预取则使用平均取样和基于Jaccard距离的近邻取样方案.通过数据路由分配数据片到相应处理节点进行处理,平均取样得到的数据片哈希码为数据路由提供路由信息,近邻取样得到的数据片哈希码用于系统首次数据去重.实验结果表明,该系统在保证数据去重率的同时,相对全节点查询和定点路由的数据片路由方式数据吞吐量提升明显.%In big data scenarios,traditional data deduplication backup system faces with defects like large data backup storage space,insufficient data throughput and so on.Aiming at these defects,this paper designs a distributed backup data dedeplication system based on data routing.It uses data chunk as deduplication granularity,whose functions involve data routing and data prefetching.Data routing uses the Bloom filter to query data chunks to be processed,and applies average sampling and neighbor sampling based on Jaccard distance to prefetch data chunks.This system uses data routing to assign data chunks to the corresponding processing nodes to deal with.Data chunks' hash code obtained through average sampling provides routing information for data routing.And data chunks' hash code obtained through neighbor sampling is used for the first data deduplication of the system.Experimental results show that the data throughput of this system increases significantly compared with all processing node query and fixed data routing,while maintaining the deduplication ratio.
    • 张多利; 张宇; 宋宇鲲; 汪健
    • 摘要: 本文针对一种面向高密度计算的异构多核SoC系统,提出了一种层次化的共享二级存储结构(L2-Cache),以缓解系统数据处理速度与外部存储间的速度差异.所设计的层次化存储结构提供对象数据缓存功能,利用计数替换策略,减少二级存储污染,提高有效数据命中率;在计算时间间隙实现数据准确预读取和L2-主存同步操作,增加有效存储带宽.最终测试结果表明,采用层次化存储结构的设计兼顾了不同访存比应用的数据访存特性,平均访存性能提高31.1%,不同规模的矩阵运算最高获得1.573的加速比,整体任务计算时间平均减少了27.8%.
    • 裴颂文; 张俊格; 宁静
    • 摘要: 对于非规则访存的应用程序,当某个应用程序的访存开销大于计算开销时,传统帮助线程的访存开销会高于主线程的计算开销,从而导致帮助线程落后于主线程。于是提出一种改进的基于参数控制的帮助线程预取模型,该模型采用梯度下降算法对控制参数求解最优值,从而有效地控制帮助线程与主线程的访存任务量,使帮助线程领先于主线程。实验结果表明,基于参数选择的线程预取模型能获得1.1~1.5倍的系统性能加速比。%To the applications with irregular accessing memory,if the overhead of accessing memory for a given application is much greater than that of computation,it will make the helper thread lag behind the main thread.Hereby,an improved helper thread pre-fetching model by adding control parameters was proposed.The gradient descent algorithm is one of the most popular machine learning algorithms,which was adopted to determine the optimal control parameters.The amount of the memory access tasks was controlled by the control parameters effectively,which makes the helper thread be finished ahead of the main thread.The experiment results show that the speedup of system performance is achieved by 1.1 times to 1.5 times.
    • 张华亮; 黄启印; 吴少校
    • 摘要: 用Linpack测试集测试了计算机系统浮点性能,测试用函数运算库为Goto BLAS库.该库对Linpach的测试结果有很大影响.为了提高Goto BLAS性能,观察了GotoBLAS库在龙芯3A2000处理器平台的性能表现,分析了测试软件的执行流程、数据的处理方法,根据处理器的结构特点,合理配置矩阵分块参数,优化核心循环的实现方案,同时采用软硬件数据预取技术及优化的内核TLB配置策略.在这些优化方法的共同作用下,仿真平台上核心函数的浮点部件效率超过90%.优化方案在本实验中取得了显著的效果.%Linpack was applied to evaluation of the performance of a computer system,and the Goto BLAS library was used as the function operation library.The performance of the library has a large impact on Linpack test results.To achieve its high performance,the study observed the performance expression of the Goto BLAS library on the Loongson 3 A2000 processor,and analyzed the testing software's execution flow and data processing methods,and then,according to the structural features of the processor,reasonably allocated the block matrix and optimized the scheme for implementation of the core loop in the function.Meanwhile,the data-fetching technologies of software and hardware,and the optimized TLB configuration schemes were adopted.With the combined effects of these optimizations,the efficiency of float point component on the simulation platform reached more than 90%,which means the optimization schemes achieved the significant results in this experiment.
  • 查看更多

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号