首页> 外文会议>International Conference on Field Programmable Logic and Applications >Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA
【24h】

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

机译:将卷积神经网络的可扩展和模块化RTL编译在FPGA上

获取原文

摘要

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.
机译:尽管其受欢迎程度,但由于大数据量,密集的计算和频繁的内存访问,部署在便携式系统上的卷积神经网络(CNNS)仍然具有挑战性。虽然以前由高级合成工具(即HLS,OpenCL)生成的先前FPGA加速方案允许快速设计优化,但在分配FPGA资源以最大化并行度和吞吐量时仍存在硬件效率。直接硬件级设计(即RTL)可以提高效率并实现更大的加速度。但是,这需要对算法结构和FPGA系统架构进行深入了解。在这项工作中,我们提出了一种可扩展的解决方案,可以集成高级合成的灵活性以及RTL实现的更精细的级别优化。基石是分析CNN结构和参数的编译器,并自动生成一组模块化和可伸缩的计算原语,可以加速各种深度学习算法。将这些模块集成在一起进行端到端的CNN实现,这项工作定量分析了共同的设计策略,以优化给定CNN模型的吞吐量与FPGA资源约束。在Altera Stratix-V GXA7 FPGA上证明了所提出的方法,用于AlexNet和Nin CNN模型,分别实现114.5个GOP和117.3 GOP。与基于Opencl的设计相比,这表示吞吐量的提高1.9倍。结果说明了对深度学习的模块化和可扩展硬件加速的自动编译器解决方案的承诺。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号