卷积神经网络高效计算关键技术研究

	卷积神经网络高效计算关键技术研究
	郭鹏
	2019-06
页数	128
学位类型	博士
中文摘要	卷积神经网络在计算机视觉、自然语言处理和语音处理等多个领域取得了巨大的成功，引领了新一轮的人工智能热潮，并成为人工智能领域应用中不可或缺的处理模块。另一方面，在带来强大算法性能的同时，卷积神经网络对计算平台的计算能力也提出了更高的需求。而移动式/嵌入式设备的计算能力受到电池、成本等的严格约束，使得卷积神经网络在此类平台上的部署面临了更大的挑战。因此，研究如何高效的进行卷积神经网络的计算，对于卷积神经网络的发展与应用，具有重大的学术意义和实用价值。本文针对卷积神经网络高效计算的问题，从基于VLIW数字信号处理器的卷积神经网络高效计算、二值神经网络加速器设计和动态精度的权值低比特网络加速器设计三个方面开展了深入研究，具体研究内容及贡献归纳如下： 1.基于VLIW数字信号处理器的高效计算研究。首先，详细分析了卷积神经网络的计算量和存储量，结合roofline模型讨论了卷积层和全连接层的计算特性，针对不同数据复用类型提出了三种串行调度方式。然后开展了基于VLIW数字信号处理器的卷积神经网络计算研究，结合上述三种串行调度方式，讨论了并行模式下如何最大化数据复用。最后，提出了一种核展开的卷积并行方式来灵活支持不同大小卷积核的运算。实验表明，利用此方案搭建的实时人脸检测系统，相比于CPU/GPU取得了显著能效提升。 2.二值神经网络加速器研究。通过将权值和激活值都量化为+1和-1，二值神经网络不仅可以大幅度减少参数规模，还能将复杂的定点乘法运算转换为简单的同或运算，从而带来性能和能效的提升。但传统二值神经网络中还有许多非二值部分，导致无法充分利用二值化的优势，影响系统整体性能。针对这一问题，本文首先通过二值化展开、奇偶填充等方法将网络的主要操作全部二值化。在此基础上，创新性地提出了一个全二值神经网络加速器，该加速器采用了一种专门针对二值卷积设计的交织-计算单元，不仅可以支持不同尺寸卷积的高并行度计算，而且通过运算分解，还可用于全连接层的计算。实验表明，相比于参考设计，取得了3.1倍的性能提升，5.4倍的资源效率提升和4.9倍的能效提升。 3.动态精度的权值低比特网络加速器研究。相比于二值神经网络，权值低比特网络仅对权值数据进行低比特量化，可以在大规模数据集上取得更好的准确率；同时通过对动态精度的支持，可以更好的实现性能与准确率的平衡。在对几类典型权值低比特网络计算操作分析的基础上，提出一种更贴近实际计算需求的比特级计算操作，并结合比特串机制设计了一个可以支持多种精度的基本计算结构。在此基础上，提出了一个动态精度的权值低比特网络加速器，并结合低比特网络计算特性进行了一系列硬件优化。实验结果表明，相比于传统动态精度网络加速器，本设计可以更广泛地支持多种权值低比特网络。
英文摘要	Convolutional neural network has achieved great success in many fields such as computer vision, natural language processing, and speech processing, leading a new round of artificial intelligence boom and becoming an indispensable processing module in the field of artificial intelligence. On the other hand, while bringing powerful algorithm performance, the convolutional neural network also puts higher demands on the computing power of the computing platform. Considering that the computing ability of mobile/embedded devices is strictly restricted by the battery and cost. The deployment of convolutional neural networks on such platforms faces greater challenges. Therefore, researching on how to efficiently calculate the convolutional neural network has great academic significance and practical value for the development and application of convolutional neural networks. In this paper, the problem of efficient computation of convolutional neural networks is studied in three aspects: efficient computation of convolutional neural networks based on VLIW digital signal processor, design of binary neural network accelerator and design of low-bit weight network accelerator with dynamic precision. The specific research contents and contributions are summarized as follows: 1. Efficient computation of convolutional neural networks based on VLIW digital signal processor. Firstly, the computational and storage quantities of convolutional neural networks are analyzed in detail. The computational characteristics of convolutional layers and fully connected layers are discussed in combination with the roofline model. Three serial scheduling methods are proposed for different types of data reuse. Then the computational research of convolutional neural network based on VLIW digital signal processor is proposed. Combined with the above three serial scheduling methods, how to maximize data reuse in parallel mode is discussed. Finally, a kernel-expanded based convolutional parallel method is proposed, which can flexibly support the operation of convolution kernels of different sizes. Experiments show that the real-time face detection system built with this scheme has achieved significant energy efficiency improvement compared to CPU/GPU. 2. Binarized neural network accelerator. By quantizing the weights and activation values to +1 and -1, the binary neural network can not only greatly reduce the parameter size, but also convert complex fixed-point multiplication operations into simple XNOR operations, resulting in improvement of performance and energy efficiency. However, there are many non-binary parts in the traditional binarized neural network, which makes it impossible to fully utilize the advantages of binarization and affect the overall performance of the system. To address this problem, this paper first binarizes the main operations of the network by means of binarization expansion and Odd-Even padding. On this basis, a full binarized neural network accelerator is proposed innovatively. The accelerator adopts a shuffle-compute unit specially designed for binary convolution, which can not only support high parallelism calculation of different size convolutions. And through the operation decomposition, it can also be reused for the calculation of the fully connected layer. Experiments show that compared to the reference design, this design achieves a 3.1x performance improvement, 5.4x resource efficiency improvement, and 4.9x energy efficiency improvement. 3. Dynamic precision weight low-bit network accelerator. Compared with the binarized neural network, the low-bit weight network which only performs quantization on the weight data can achieve better accuracy on large-scale data sets. At the same time, the tradeoff can be better realized by supporting dynamic precision. Based on the analysis of several typical low-bit weight network computing operations, a bit-level computing operation closer to the actual computing requirements is proposed. Combined with the bit-serial mechanism, a basic computing structure that can support multiple precisions is designed. On this basis, a dynamic precision weight low-bit network accelerator is proposed, and a series of hardware optimizations are carried out in combination with the low bit network computing characteristics. The experimental results show that this design can support a variety of weight low bit networks more widely than traditional dynamic precision network accelerators.
关键词	深度神经网络卷积神经网络二值网络低比特量化神经网络加速器
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23875
专题	国家专用集成电路设计工程技术研究中心
推荐引用方式 GB/T 7714	郭鹏. 卷积神经网络高效计算关键技术研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（5436KB）	学位论文		开放获取	CC BY-NC-SA