...
首页> 外文期刊>Journal of signal processing systems for signal, image, and video technology >FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling
【24h】

FPGA-Based Inter-layer Pipelined Accelerators for Filter-Wise Weight-Balanced Sparse Fully Convolutional Networks with Overlapped Tiling

机译:基于FPGA的层间流水线加速器,用于滤波器的重量平衡的稀疏完全卷积网络,具有重叠的百帘

获取原文
获取原文并翻译 | 示例
           

摘要

Convolutional neural networks (CNNs) exhibit state-of-the-art performance while performing computer-vision tasks. CNNs require high-speed, low-power, and high-accuracy hardware for various scenarios, such as edge environments. However, the number of weights is so large that embedded systems cannot store them owing to their limited on-chip memory. A different method is used to minimize the input image size, for real-time processing, but it causes a considerable drop in accuracy. Although pruned sparse CNNs and special accelerators are proposed, the requirement of random access incurs a large number of wide multiplexers for a high degree of parallelism, which becomes more complicated and unsuitable for FPGA implementation. To address this problem, we propose filter-wise pruning with distillation and block RAM (BRAM)-based zero-weight skipping accelerator. It eliminates weights such that each filter has the same number of nonzero weights, performing retraining with distillation, while retaining comparable accuracy. Further, filter-wise pruning enables our accelerator to exploit inter-filter parallelism, where a processing block for a layer executes filters concurrently, with a straightforward architecture. We also propose an overlapped tiling algorithm, where tiles are extracted with overlap to prevent both accuracy degradation and high utilization of BRAMs storing high-resolution images. Our evaluation using semantic-segmentation tasks showed a 1.8 times speedup and 18.0 times increase in power efficiency of our FPGA design compared with a desktop GPU. Additionally, compared with the conventional FPGA implementation, the speedup and accuracy improvement were 1.09 times and 6.6 points, respectively. Therefore, our approach is useful for FPGA implementation and exhibits considerable accuracy for applications in embedded systems.
机译:卷积神经网络(CNNS)在执行计算机视觉任务时展示最先进的性能。 CNNS需要用于各种场景的高速,低功耗和高精度硬件,例如边缘环境。但是,权重的数量如此之大,嵌入式系统由于其片上存储器而无法存储它们。使用不同的方法来最小化输入图像大小,以进行实时处理,但它会导致准确性相当大的降低。尽管提出了修剪的稀疏CNN和特殊的加速器,但随机接入的要求会引发大量宽多路复用器,高度并行性,这变得更加复杂和不适合FPGA实现。为了解决这个问题,我们提出了用蒸馏和块柱塞(BRAM)的滤波器明亮的零重跳闸加速器。它消除了重量,使得每个过滤器具有相同数量的非零重量,在蒸馏中进行再培训,同时保持可比的精度。此外,滤波器明示使我们的加速器能够利用滤波器​​间并行性,其中图层的处理块同时执行滤波器,具有简单的架构。我们还提出了一种重叠的平铺算法,其中通过重叠提取瓦片,以防止存储高分辨率图像的精度下降和高利用曲线。我们使用语义分割任务的评估显示,与桌面GPU相比,我们的FPGA设计的功率效率提高了1.8倍的增速和18.0倍。此外,与传统的FPGA实施相比,加速和精度改善分别为1.09倍和6.6点。因此,我们的方法对于FPGA实施是有用的,并且对嵌入式系统中的应用具有相当大的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号