...
首页> 外文期刊>Electronics >AnScalable Matrix Computing Unit Architecture for FPGA, and SCUMO User Design Interface
【24h】

AnScalable Matrix Computing Unit Architecture for FPGA, and SCUMO User Design Interface

机译:适用于FPGA的AnScalable矩阵计算单元架构和SCUMO用户设计接口

获取原文
           

摘要

High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations’ performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N × N matrices, the architecture requires N ALU-RAM blocks and performs O ( N 2 ) , requiring N 2 + 7 and N + 7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 × 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units.
机译:高维矩阵代数在众多信号处理和机器学习算法中至关重要。这项工作描述了一种基于循环矩阵设计的可缩放方阵计算单元。它优化了数据流,可用于计算任意顺序的矩阵运算,从而消除了中间结果数据移动的需要,以及单个矩阵运算以直接或转置形式的性能(转置矩阵运算仅需要数据寻址修改)。允许的矩阵运算为:逐矩阵加法,减法,点积和乘法,逐矢量乘法和标量乘矩阵。所提出的架构是完全可扩展的,最大矩阵尺寸受可用资源限制。此外,还开发了一种设计环境,允许通过友好的界面提供帮助,从硬件计算单元的定制到最终可合成IP核的生成。对于N×N矩阵,该体系结构需要N个ALU-RAM块并执行O(N 2),分别需要N 2 + 7和N + 7个时钟周期进行矩阵矩阵操作和矩阵矢量操作。对于经过测试的Virtex7 FPGA器件,500×500矩阵的计算允许最大346 MHz的时钟频率,实现173 GOPS的整体性能。与其他最新的矩阵计算单元相比,该体系结构具有更高的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号