首页> 外文期刊>ACM transactions on mathematical software >High-Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures
【24h】

High-Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures

机译:均质多核体系结构上使用平铺算法的高性能双对角线化

获取原文
获取原文并翻译 | 示例
           

摘要

This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000 × 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.
机译:本文介绍了一种适用于同类多核体系结构的新型高性能双对角线缩减(BRD)。本文是同一作者[Luszczek等人,IPDPS 2011]对BRD案例实施的高性能三对角线缩减的扩展。 BRD是计算矩阵奇异值分解的第一步,由于其在计算科学中的广泛影响,它是数值线性代数中最重要的算法之一。本文介绍的BRD的高性能来自四个重要特征的结合:(1)具有瓦片数据布局的瓦片算法,可在主存储器中提供有效的数据表示; (2)两阶段缩减方法,该方法允许将第一阶段(还原为带形式)中的大部分计算转换为对Level 3 BLAS的调用,并减少第二阶段中的内存流量(从频带变为双角形形式)通过使用针对高速缓存重用而优化的高性能内核; (3)数据依赖转换层,其将具有列主数据布局的通用算法映射到瓦片数据布局中; (4)动态运行时系统,可以有效地跨处理单元调度新实现的内核,并确保不违反数据依赖性。提供了详细的分析,以了解切片大小对总执行时间的关键影响,这也与减少第一阶段后的矩阵带宽大小相对应。性能结果显示,与当前建立的替代方案相比有显着改进。新的高性能BRD在16核Intel Xeon机器(矩阵大小为12000×12000)上与最新的开源和商业数字软件包LAPACK相比,可达到30倍的加速使用来自MKL的优化和多线程BLAS以及Intel MKL版本10.2。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号