Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures

Abdelfattah Ahmad; Haidar Azzam; Tomov Stanimire; Dongarra Jack

首页> 外文期刊>Journal of computational science >Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures

【24h】

Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures

机译：使用GPU批量处理微小矩阵的单方面分解：挑战与对策

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8x. (C) 2018 Elsevier B.V. All rights reserved.

机译：最近，批处理矩阵计算的使用引起了许多应用的兴趣，在这些应用中，相同的运算被应用于许多小的独立矩阵。批处理计算模式在数据分析，直接/迭代求解器和预处理器，计算机视觉，天体物理学等应用程序中经常遇到，并且通常需要针对矢量化和极端并行性的特定设计才能很好地映射到当今的高端多核架构上。这导致开发了用于批处理计算的优化软件，并导致社区不断努力开发用于批处理线性代数软件的标准接口。进一步推动这些发展，我们介绍了GPU设计和优化技术，用于数以百万计的微小矩阵（尺寸小于等于32）的高性能批量单方面分解。我们选择不同技术的效果和相关性，以选择性能最佳的LU，QR和Cholesky分解设计。尽管我们采用了常见的优化技术，例如最佳内存流量，寄存器阻塞和并发控制，但我们还表明，当矩阵很小（尤其是子矢量/扭曲）时，需要不同的思维方式和技术。所提议的例程是MAGMA库的一部分，与当前可利用的供应商优化库中的例程相比，它们提供了显着的加速。值得注意的是，我们调整了NVIDIA最新的V100 GPU的开发，以显示最高11.8倍的加速。（C）2018 Elsevier B.V.保留所有权利。

著录项

来源
《Journal of computational science》 |2018年第5期|226-236|共11页
作者
Abdelfattah Ahmad; Haidar Azzam; Tomov Stanimire; Dongarra Jack;
展开▼
作者单位

Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA;

Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA;

Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA;

Univ Tennessee, Innovat Comp Lab, Knoxville, TN 37996 USA;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
GPU computing; Matrix factorization; Batch computation;

机译：GPU计算;矩阵分解;批量计算;

相似文献

外文文献
中文文献
专利

1. Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures [J] . Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Procedia Computer Science . 2017,第1期

机译：使用GPU对一百万个矩阵进行分解和反转：挑战和对策
2. Fast Cholesky factorization on GPUs for batch and native modes in MAGMA [J] . Abdelfattah Ahmad, Haidar Azzam, Tomov Stanimire, Journal of computational science . 2017,第May期

机译：在MAGMA中针对批处理和本机模式在GPU上进行快速的Cholesky分解
3. Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs [J] . Jakub Kurzak, Hartwig Anzt, Mark Gates, IEEE Transactions on Parallel and Distributed Systems . 2016,第7期

机译：NVIDIA GPU的批量Cholesky分解和解决方案的实现和优化
4. Batched Cholesky factorization for tiny matrices [C] . Florian Lemaitre, Lionel Lacassagne Conference on Design and Architectures for Signal and Image Processing . 2016

机译：批处理的Cholesky因式分解适用于微小矩阵
5. Fault Tolerant and Energy Effcient One-Sided Matrix Decompositions on Heterogeneous Systems with GPUs [D] . Chen, Jieyang. 2019

机译：具有GPU的异构系统上的容错和高能效单面矩阵分解
6. NMF-mGPU: non-negative matrix factorization on multi-GPU systems [O] . Edgardo Mejía-Roa, Daniel Tabas-Madrid, Javier Setoain, 2015

机译：NMF-mGPU：多GPU系统上的非负矩阵分解
7. Factorization and Inversion of a Million Matrices using GPUs: Challenges and Countermeasures [O] . Abdelfattah, Ahmad, Haidar, Azzam, Tomov, Stanimire, 2017

机译：使用GpU的百万像素的分解和反演：挑战与对策

Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures

摘要

著录项

相似文献

相关主题

期刊订阅