Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

机译：将卷积神经网络的可扩展和模块化RTL编译在FPGA上

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources to maximize parallelism and throughput. A direct hardware-level design (i.e., RTL) can improve the efficiency and achieve greater acceleration. However, this requires an in-depth understanding of both the algorithm structure and the FPGA system architecture. In this work, we present a scalable solution that integrates the flexibility of high-level synthesis and the finer level optimization of an RTL implementation. The cornerstone is a compiler that analyzes the CNN structure and parameters, and automatically generates a set of modular and scalable computing primitives that can accelerate various deep learning algorithms. Integrating these modules together for end-to-end CNN implementations, this work quantitatively analyzes the complier's design strategy to optimize the throughput of a given CNN model with the FPGA resource constraints. The proposed methodology is demonstrated on Altera Stratix-V GXA7 FPGA for AlexNet and NIN CNN models, achieving 114.5 GOPS and 117.3 GOPS, respectively. This represents a 1.9× improvement in throughput when compared to the OpenCL-based design. The results illustrate the promise of the automatic compiler solution for modularized and scalable hardware acceleration of deep learning.

机译：尽管其受欢迎程度，但由于大数据量，密集的计算和频繁的内存访问，部署在便携式系统上的卷积神经网络（CNNS）仍然具有挑战性。虽然以前由高级合成工具（即HLS，OpenCL）生成的先前FPGA加速方案允许快速设计优化，但在分配FPGA资源以最大化并行度和吞吐量时仍存在硬件效率。直接硬件级设计（即RTL）可以提高效率并实现更大的加速度。但是，这需要对算法结构和FPGA系统架构进行深入了解。在这项工作中，我们提出了一种可扩展的解决方案，可以集成高级合成的灵活性以及RTL实现的更精细的级别优化。基石是分析CNN结构和参数的编译器，并自动生成一组模块化和可伸缩的计算原语，可以加速各种深度学习算法。将这些模块集成在一起进行端到端的CNN实现，这项工作定量分析了共同的设计策略，以优化给定CNN模型的吞吐量与FPGA资源约束。在Altera Stratix-V GXA7 FPGA上证明了所提出的方法，用于AlexNet和Nin CNN模型，分别实现114.5个GOP和117.3 GOP。与基于Opencl的设计相比，这表示吞吐量的提高1.9倍。结果说明了对深度学习的模块化和可扩展硬件加速的自动编译器解决方案的承诺。

著录项

来源
《International Conference on Field Programmable Logic and Applications》|2016年|579p|共8页
会议地点
作者
Yufei Ma; Naveen Suda; Yu Cao; Jae-sun Seo; Sarma Vrudhula;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP313-53;
关键词
Field programmable gate arrays; Convolution; Hardware; Kernel; Acceleration; Memory management; Algorithm design and analysis;

机译：现场可编程门阵列;卷积;硬件;内核;加速;内存管理;算法设计和分析;

相似文献

外文文献
中文文献
专利

1. A fast and scalable architecture to run convolutional neural networks in low density FPGAs [J] . Vestias Mario P., Duarte Rui P., de Sousa Jose T., Microprocessors and microsystems . 2020,第Sepa期

机译：一种快速且可扩展的架构，可在低密度FPGA中运行卷积神经网络
2. fpgaConvNet: Mapping Regular and Irregular Convolutional Neural Networks on FPGAs [J] . Venieris Stylianos I, Bouganis Christos-Savvas Neural Networks and Learning Systems, IEEE Transactions on . 2019,第2期

机译：fpgaConvNet：在FPGA上映射规则和不规则卷积神经网络
3. RSNN: A Software/Hardware Co-Optimized Framework for Sparse Convolutional Neural Networks on FPGAs [J] . Weijie You, Chang Wu Quality Control, Transactions . 2021,第1期

机译：RSNN：FPGA上的稀疏卷积神经网络的软件/硬件共同优化框架
4. Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA [C] . Yufei Ma, Naveen Suda, Yu Cao, International Conference on Field Programmable Logic and Applications . 2016

机译：卷积神经网络到FPGA的可扩展和模块化RTL编译
5. Caffeinated FPGAs: FPGA Framework for Training and Inference of Convolutional Neural Networks With Reduced Precision Floating-Point Arithmetic [D] . DiCecco, Roberto. 2018

机译：含咖啡因的FPGA：用于训练和推理卷积神经网络的FPGA框架，具有降低的精度浮点算法
6. FPGA Implementation for Odor Identification with Depthwise Separable Convolutional Neural Network [O] . Zhuofeng Mo, Dehan Luo, Tengteng Wen, 2021

机译：FPGA实现气味识别与深度可分离的卷积神经网络
7. Scalable High-Performance Architecture for Convolutional Ternary Neural Networks on FPGA [O] . Prost-Boucle, Adrien, BOURGE, Alban, Pétrot, Frédéric, 2017

机译：FPGA上卷积三元神经网络的可扩展高性能架构

Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA

摘要

著录项

相似文献

相关主题

期刊订阅