|关键词||Fpga 卷积神经网络 加速器 分块卷积|
1.循环分块(Loop Tiling)是基于FPGA的设计中缓解片上存储压力的一种重要方法。然而，卷积操作本身的特性使得相邻分块的边界处存在数据依赖，这样的结果是，片外缓存的数据存在物理不连续的问题，在重新传回片内之前需要进行复杂的数据重组，造成DMA传输的复杂度大大增加，传输延时也随之提高。针对这一问题，本文提出了面向FPGA的大规模深度卷积神经网络优化方法：分块卷积，其目的在于用多个部分区域上的卷积替代整体卷积，从而完全消除不同分块间数据依赖性。我们将分块卷积利用到已有的VGG-16加速器中，结果表明，基于分块卷积的加速器可以充分利用传输带宽，减少DMA数据传输的设计复杂度和配置开销，降低了35\%的系统延时。此外，分块卷积在ILSVRC 12数据集上的分类精度相比于原始模型没有任何损失。
2.DRAM频繁访问是系统功耗和延时的主要来源。高效的计算都需要尽可能减少片外传输，但是对于大规模网络，如VGG-16，当某一层中间结果的绝对数据量大于片内缓存时，仍然无法避免片外传输。我们发现，基于分块卷积的深度卷积神经网络可以很好的进行多层融合，也就是说，当多个卷积层融合在一起进行运算时，我们并不需要将中间层的结果缓存，这样可以大大减少片上缓存的开销。同时，由于分块卷积消除了块间依赖，可以进一步提升加速器的吞吐率。基于此，我们设计了基于分块卷积和多层融合的加速器。我们首次实现了VGG-16网络在中低端资源受限FPGA(Xilinx ZYNQ ZC706)上的高效加速而完全不需要中间结果的片外传输，并且性能高达394.75 GOP/s，帧率达到12.19帧/秒，综合性能达到了28nm工艺中低端FPGA上的最高纪录。我们也用类似的方法在YOLOv2检测模型上做了实验，加速器性能达到263.33 GOP/s，能够满足实时性的需求
In recent years, deep Convolutional Neural Networks (CNN) have been adopted in various applications such as image classification, object detection, speech recognition, etc. It has shown unprecedented high performance while suffering from huge computational complexity. With the networks growing in depth and width, CPU-based solutions can no longer meet the requirements in real-world applications. In addition, although GPUs have powerful computational capabilities, they are limited in size and energy consumption and cannot be efficiently deployed on edge computing devices such as drones and wearable devices. As a result, dedicated hardware accelerators for CNN have received an increasing attention in the past few years. Among them, FPGA based accelerators have gradually attracted the main efforts in both industry and academia due to their configurability, energy efficiency and higher performance-to-power ratio.
Deep learning algorithms usually contain a backbone network with strong ability of feature expression. However, along with high accuracies, there are huge volumn of computations, parameters and intermediate data involved in the large-scale network. Therefore, how to effectively deploy these type of large-scale deep networks on resource-constraint FPGA platform is the key to the overall system's ability to meet the real-world requirements. In this article, we mainly focus on the optimization and deployment of very large-scale CNN on resource-limited FPGA. The concerns and contributions of this article are summarized as follows:
1.One of the biggest challenge of using FPGA to accelerate large-scale deep networks is the limitation of available on-chip resources both for computing and storage. It's a common practice to first partition each layer into several some groups and compute in batches. The partial results of a group are first transferred to off-chip memory and then fetched back for the computation of a next layer. However, data dependencies exist among the boundary pixels between adjacent blocks due to the characteristic of convolution. As a result, the data cached in off-chip memory are physical non-contiguous and should be first reorganized before fetching back to on-chip memory. The complex data organization introduces huge overhead of DMA configuration. In this article, we propose a FPGA-oriented, novel, simple yet effective approach name block convolution. The main purpose of the block convolution is to replace original global convolution with several independent local convolutions so as to completely eliminate the data dependencies between adjacent blocks as well as design overheads on memory management. We optimize an existing accelerator for VGG-16 network based on our block convolution approach, the results show that about 35\% DMA transfer overheads are reduced. In addition, with block convolution, the classification accuracies of VGG-16 network on ILSVRC 12 dataset are comparable or even higher.
2.Another issue of deploying large-scale CNN on resource-constraint FPGA is that frequent off-chip transfers of intermediate data introduce unexpected energy consumption and latency, since the operations on DRAM are much more expensive than those on SRAM and register files. Although existing efforts tried to minimize off-chip transfers in accelerator design, for very-large networks such as VGG-16, off-chip transfers cannot be totally removed. We observe that the block convolution approach is very well suitable for multi-layer fusion. That is, multiple convolutional layers can be fused together for computation, and there's no need to buffer the data one layer after another. We notice that with block convolution, even very large CNN can be fit on chip without any off-chip transfers of intermediate data. Finally, We design an accelerator based on the block convolution, we are the first to accelerate VGG-16 on memory-limited FPGA (Xilinx ZYNQ ZC706 board) with all intermediate data staying on chip during inference. It achieves the throughput of 394.75 GOP/s and a frame rate of 12.19 fps, which is the best record so far among resource-limited FPGA with 28nm technology. We also conduct experiment on the YOLOv2 model for object detection, it can meet the real-time requirement with a throughput of 263.33 GOP/s.
|李钢. 基于FPGA的深度卷积神经网络高效计算与加速研究[D]. 北京. 中国科学院大学,2018.|