High-Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures

HATEM LTAIEF; PIOTR LUSZCZEK; JACK DONGARRA

首页> 外文期刊>ACM transactions on mathematical software >High-Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures

【24h】

High-Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures

机译：均质多核体系结构上使用平铺算法的高性能双对角线化

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000 × 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.

机译：本文介绍了一种适用于同类多核体系结构的新型高性能双对角线缩减（BRD）。本文是同一作者[Luszczek等人，IPDPS 2011]对BRD案例实施的高性能三对角线缩减的扩展。 BRD是计算矩阵奇异值分解的第一步，由于其在计算科学中的广泛影响，它是数值线性代数中最重要的算法之一。本文介绍的BRD的高性能来自四个重要特征的结合：（1）具有瓦片数据布局的瓦片算法，可在主存储器中提供有效的数据表示；（2）两阶段缩减方法，该方法允许将第一阶段（还原为带形式）中的大部分计算转换为对Level 3 BLAS的调用，并减少第二阶段中的内存流量（从频带变为双角形形式）通过使用针对高速缓存重用而优化的高性能内核；（3）数据依赖转换层，其将具有列主数据布局的通用算法映射到瓦片数据布局中；（4）动态运行时系统，可以有效地跨处理单元调度新实现的内核，并确保不违反数据依赖性。提供了详细的分析，以了解切片大小对总执行时间的关键影响，这也与减少第一阶段后的矩阵带宽大小相对应。性能结果显示，与当前建立的替代方案相比有显着改进。新的高性能BRD在16核Intel Xeon机器（矩阵大小为12000×12000）上与最新的开源和商业数字软件包LAPACK相比，可达到30倍的加速使用来自MKL的优化和多线程BLAS以及Intel MKL版本10.2。

著录项

来源
《ACM transactions on mathematical software》 |2013年第3期|16.1-16.22|共22页
作者
HATEM LTAIEF; PIOTR LUSZCZEK; JACK DONGARRA;
展开▼
作者单位

Kaust Supercomputing Laboratory, Thuwal, Saudi Arabia;

Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996;

Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Bidiagional reduction; tile algorithms; two-stage approach; bulge chasing; data translation layer; high performance kernels; dynamic scheduling;

机译：双向减少;平铺算法;两阶段方法;隆起追逐数据转换层;高性能内核;动态调度;

相似文献

外文文献
中文文献
专利

1. Parallel Two-Sided Matrix Reduction to Band Bidiagonal Form on Multicore Architectures [J] . Ltaief H., Kurzak J., Dongarra J. Parallel and Distributed Systems, IEEE Transactions on . 2010,第4期

机译：多核架构上并行的两面矩阵归约为带对角线形式
2. Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures [J] . Azzam Haidar, Hatem Ltaief, Asim YarKhan, Concurrency and computation: practice and experience . 2012,第3期

机译：多核体系结构上稠密线性代数的动态调度图块算法分析
3. Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures [J] . HatemLtaief, JakubKurzak, JackDongarra, Scientific programming . 2010,第1期

机译：在多核体系结构上使用图块算法调度双向转换
4. Enhancing Parallelism of Tile Bidiagonal Transformation on Multicore Architectures Using Tree Reduction [C] . Hatem Ltaief, Piotr Luszczek, Jack Dongarra International conference on parallel processing and applied mathematics . 2012

机译：使用树约简增强多核架构上平铺双对角变换的并行性
5. Tiled algorithms for matrix computations on multicore architectures. [D] . Bouwmeester, Henricus M. 2012

机译：用于多核架构上矩阵计算的平铺算法。
6. Spatial Division Multiplexed Microwave Signal processing by selective grating inscription in homogeneous multicore fibers [O] . Ivana Gasulla, David Barrera, Javier Hervás, -1

机译：均质多芯光纤中通过选择性光栅刻写的空分复用微波信号处理
7. High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures [O] . Ltaief, Hatem, Luszczek, Piotr R., Dongarra, Jack 2013

机译：在同类多核体系结构上使用图块算法进行高性能对角线折减

High-Performance Bidiagonal Reduction using Tile Algorithms on Homogeneous Multicore Architectures

摘要

著录项

相似文献

相关主题

期刊订阅