...
首页> 外文期刊>Journal of computational science >Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures
【24h】

Batched one-sided factorizations of tiny matrices using GPUs: Challenges and countermeasures

机译:使用GPU批量处理微小矩阵的单方面分解:挑战与对策

获取原文
获取原文并翻译 | 示例
           

摘要

The use of batched matrix computations recently gained a lot of interest for applications, where the same operation is applied to many small independent matrices. The batched computational pattern is frequently encountered in applications of data analytics, direct/iterative solvers and preconditioners, computer vision, astrophysics, and more, and often requires specific designs for vectorization and extreme parallelism to map well on today's high-end many-core architectures. This has led to the development of optimized software for batch computations, and to an ongoing community effort to develop standard interfaces for batched linear algebra software. Furthering these developments, we present GPU design and optimization techniques for high-performance batched one-sided factorizations of millions of tiny matrices (of size 32 and less). We quantify the effects and relevance of different techniques in order to select the best-performing LU, QR, and Cholesky factorization designs. While we adapt common optimization techniques, such as optimal memory traffic, register blocking, and concurrency control, we also show that a different mindset and techniques are needed when matrices are tiny, and in particular, sub-vector/warp in size. The proposed routines are part of the MAGMA library and deliver significant speedups compared to their counterparts in currently available vendor-optimized libraries. Notably, we tune the developments for the newest V100 GPU from NVIDIA to show speedups of up to 11.8x. (C) 2018 Elsevier B.V. All rights reserved.
机译:最近,批处理矩阵计算的使用引起了许多应用的兴趣,在这些应用中,相同的运算被应用于许多小的独立矩阵。批处理计算模式在数据分析,直接/迭代求解器和预处理器,计算机视觉,天体物理学等应用程序中经常遇到,并且通常需要针对矢量化和极端并行性的特定设计才能很好地映射到当今的高端多核架构上。这导致开发了用于批处理计算的优化软件,并导致社区不断努力开发用于批处理线性代数软件的标准接口。进一步推动这些发展,我们介绍了GPU设计和优化技术,用于数以百万计的微小矩阵(尺寸小于等于32)的高性能批量单方面分解。我们选择不同技术的效果和相关性,以选择性能最佳的LU,QR和Cholesky分解设计。尽管我们采用了常见的优化技术,例如最佳内存流量,寄存器阻塞和并发控制,但我们还表明,当矩阵很小(尤其是子矢量/扭曲)时,需要不同的思维方式和技术。所提议的例程是MAGMA库的一部分,与当前可利用的供应商优化库中的例程相比,它们提供了显着的加速。值得注意的是,我们调整了NVIDIA最新的V100 GPU的开发,以显示最高11.8倍的加速。 (C)2018 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号