CASIA OpenIR  > 毕业生  > 硕士学位论文
基于FPGA的深度卷积神经网络高效计算与加速研究
李钢
学位类型工程硕士
导师程健
2018-06
学位授予单位中国科学院大学
学位授予地点北京
关键词Fpga 卷积神经网络 加速器 分块卷积
摘要
近年来,深度卷积神经网络模型被广泛用于各种应用中,如图像分类、物体检测、语音识别等,并且展示了前所未有的高性能。然而,深度模型在提供高精度的同时也引入了巨大的计算复杂度。随着网络模型的加宽和加深,基于CPU的计算系统已经无法满足现实需求。另一方面,虽然GPU具有强大的计算能力,但是受限于体积和能耗,无法高效部署于边缘计算设备,如无人机、可穿戴设备等。因此,专用深度网络加速器受到了越来越多的关注。其中,基于FPGA的神经网络加速器由于具有较好的灵活性和较高的性能功耗比,逐渐成为工业界和学术界的研究焦点。
深度神经网络的计算量、中间结果量和参数量都十分巨大。如何将这类大规模深度网络在资源相对受限的FPGA平台上高效的部署是整体系统能否达到实际需求的关键,因此,本文主要围绕大规模深度卷积神经网络在资源受限FPGA平台上的高效加速而展开。
本文的主要内容和贡献如下:
1.循环分块(Loop Tiling)是基于FPGA的设计中缓解片上存储压力的一种重要方法。然而,卷积操作本身的特性使得相邻分块的边界处存在数据依赖,这样的结果是,片外缓存的数据存在物理不连续的问题,在重新传回片内之前需要进行复杂的数据重组,造成DMA传输的复杂度大大增加,传输延时也随之提高。针对这一问题,本文提出了面向FPGA的大规模深度卷积神经网络优化方法:分块卷积,其目的在于用多个部分区域上的卷积替代整体卷积,从而完全消除不同分块间数据依赖性。我们将分块卷积利用到已有的VGG-16加速器中,结果表明,基于分块卷积的加速器可以充分利用传输带宽,减少DMA数据传输的设计复杂度和配置开销,降低了35\%的系统延时。此外,分块卷积在ILSVRC 12数据集上的分类精度相比于原始模型没有任何损失。
2.DRAM频繁访问是系统功耗和延时的主要来源。高效的计算都需要尽可能减少片外传输,但是对于大规模网络,如VGG-16,当某一层中间结果的绝对数据量大于片内缓存时,仍然无法避免片外传输。我们发现,基于分块卷积的深度卷积神经网络可以很好的进行多层融合,也就是说,当多个卷积层融合在一起进行运算时,我们并不需要将中间层的结果缓存,这样可以大大减少片上缓存的开销。同时,由于分块卷积消除了块间依赖,可以进一步提升加速器的吞吐率。基于此,我们设计了基于分块卷积和多层融合的加速器。我们首次实现了VGG-16网络在中低端资源受限FPGA(Xilinx ZYNQ ZC706)上的高效加速而完全不需要中间结果的片外传输,并且性能高达394.75 GOP/s,帧率达到12.19帧/秒,综合性能达到了28nm工艺中低端FPGA上的最高纪录。我们也用类似的方法在YOLOv2检测模型上做了实验,加速器性能达到263.33 GOP/s,能够满足实时性的需求
其他摘要
In recent years, deep Convolutional Neural Networks (CNN) have been adopted in various applications such as image classification, object detection, speech recognition, etc. It has shown unprecedented high performance while suffering from huge computational complexity. With the networks growing in depth and width, CPU-based solutions can no longer meet the requirements in real-world applications. In addition, although GPUs have powerful computational capabilities, they are limited in size and energy consumption and cannot be efficiently deployed on edge computing devices such as drones and wearable devices. As a result, dedicated hardware accelerators for CNN have received an increasing attention in the past few years. Among them, FPGA based accelerators have gradually attracted the main efforts in both industry and academia due to their configurability, energy efficiency and higher performance-to-power ratio.
Deep learning algorithms usually contain a backbone network with strong ability of feature expression. However, along with high accuracies, there are huge volumn of computations, parameters and intermediate data involved in the large-scale network. Therefore, how to effectively deploy these type of large-scale deep networks on resource-constraint FPGA platform is the key to the overall system's ability to meet the real-world requirements. In this article, we mainly focus on the optimization and deployment of very large-scale CNN on resource-limited FPGA. The concerns and contributions of this article are summarized as follows:
1.One of the biggest challenge of using FPGA to accelerate large-scale deep networks is the limitation of available on-chip resources both for computing and storage. It's a common practice to first partition each layer into several some groups and compute in batches. The partial results of a group are first transferred to off-chip memory and then fetched back for the computation of a next layer. However, data dependencies exist among the boundary pixels between adjacent blocks due to the characteristic of convolution. As a result, the data cached in off-chip memory are physical non-contiguous and should be first reorganized before fetching back to on-chip memory. The complex data organization introduces huge overhead of DMA configuration. In this article, we propose a FPGA-oriented, novel, simple yet effective approach name block convolution. The main purpose of the block convolution is to replace original global convolution with several independent local convolutions so as to completely eliminate the data dependencies between adjacent blocks as well as design overheads on memory management. We optimize an existing accelerator for VGG-16 network based on our block convolution approach, the results show that about 35\% DMA transfer overheads are reduced. In addition, with block convolution, the classification accuracies of VGG-16 network on ILSVRC 12 dataset are comparable or even higher.
2.Another issue of deploying large-scale CNN on resource-constraint FPGA is that frequent off-chip transfers of intermediate data introduce unexpected energy consumption and latency, since the operations on DRAM are much more expensive than those on SRAM and register files. Although existing efforts tried to minimize off-chip transfers in accelerator design, for very-large networks such as VGG-16, off-chip transfers cannot be totally removed. We observe that the block convolution approach is very well suitable for multi-layer fusion. That is, multiple convolutional layers can be fused together for computation, and there's no need to buffer the data one layer after another. We notice that with block convolution, even very large CNN can be fit on chip without any off-chip transfers of intermediate data. Finally, We design an accelerator based on the block convolution, we are the first to accelerate VGG-16 on memory-limited FPGA (Xilinx ZYNQ ZC706 board) with all intermediate data staying on chip during inference. It achieves the throughput of 394.75 GOP/s and a frame rate of 12.19 fps, which is the best record so far among resource-limited FPGA with 28nm technology. We also conduct experiment on the YOLOv2 model for object detection, it can meet the real-time requirement with a throughput of 263.33 GOP/s.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21597
专题毕业生_硕士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
李钢. 基于FPGA的深度卷积神经网络高效计算与加速研究[D]. 北京. 中国科学院大学,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于FPGA的深度卷积神经网络高效计算与(4929KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李钢]的文章
百度学术
百度学术中相似的文章
[李钢]的文章
必应学术
必应学术中相似的文章
[李钢]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。