Other Abstract 
近几年来，深度神经网络发展迅猛，引起了包括学术界和产业界的广泛关注。这项技术在计算机视觉、语音识别以及自然语言处理等众多领域取得了巨大的突破，显著提高了算法在多种任务上的性能。目前，深度神经网络已经在电子商务、视频监控、自动驾驶、辅助医疗等行业中获得广泛的应用，并逐渐成为这些智能应用中不可或缺的基础工具。
深度神经网络的快速发展，除了得益于新的网络模型和更高效的训练方法以外，另两个重要因素是海量的数据和强大的计算力。海量带标注的数据可以有效减轻深度神经网络训练过程中的过拟合现象，因此允许研究者设计足够大的网络；同时由于GPU的发展，使得大规模深度神经网络的训练成为可能。然而，随着深度神经网络性能不断提升，网络结构变得越来越复杂，网络的计算量和所需要的存储也随之变大。另一方面，随着深度神经网络技术的成熟，对神经网络在各种智能应用和系统中的部署需求越来越强烈。然而，对于很多资源受限的嵌入式设备以及对实时性要求比较高的应用而言，如此高昂的计算和存储代价成为阻碍神经网络部署的主要障碍。因此，研究深度神经网络的加速与压缩方法，对于进一步提升神经网络的运行效率，推动神经网络技术在各领域的应用，具有重要的理论意义和应用价值。
本文针对深度神经网络的加速和压缩问题，从神经网络的低秩分解、低比特定点量化等方面展开深入研究，具体研究内容以及创新点归纳如下：

提出了一种基于张量低秩组稀疏分解的卷积神经网络加速方法。在卷积神经网络中，主要的计算量都集中在卷积层和全连接层，并且最终都可以转换成矩阵乘法。因此，用于矩阵乘法加速的方法，比如矩阵低秩分解以及矩阵稀疏化，理论上都可以用来对卷积神经网络进行加速。然而，基于矩阵低秩分解的方法，为了保证分解的精度，往往需要选择比较高的秩，因此加速比较有限；而基于矩阵稀疏化的方法，虽然理论上能大大降低计算量，但是由于无规则的稀疏，对硬件很不友好，因此实际的加速效果也非常有限。针对这两个问题，本文提出了一种基于张量低秩租稀疏分解的卷积神经网络加速方法，通过把卷积核张量分解成多个低秩张量加和的形式，使得分解后的核心张量是分块稀疏的，从而可以在秩比较高的情况下，有效降低计算量；同时由于是有规则的分块稀疏，能够获得较好的实际加速效果。

提出了一种基于矩阵定点分解的深度神经网络权值量化方法。在神经网络中，所有权值都是浮点数，而浮点数操作在硬件实现中会占用大量的资源。对神经网络参数进行低比特定点量化可以大大降低网络对硬件资源的消耗。然而，在量化比特数比较低的情况下，网络的精度损失比较大。针对这一问题，本文提出了一种基于矩阵定点分解的神经网络权值量化方法，对于给定的预训练好的浮点网络，对其权值矩阵进行定点分解，从而在保证精度的前提下，实现对神经网络权值的量化；同时，本文提出了一种全精度权值恢复方法，来缓解权值矩阵定点分解过程中的信息损失问题；此外，我们还对基于矩阵分解的网络加速方法中普遍存在的梯度不匹配的现象进行了理论分析，并提出了一种权值均衡的方法来解决该问题。

提出了一种两阶段低比特量化方法实现对神经网络权值和激活的量化。神经网络在运行阶段主要包括两部分，即网络的权值和激活。如果只对网络权值进行定点量化，网络运行阶段仍然需要大量的浮点数操作，因此，有必要对网络的激活也进行定点量化。传统的神经网络定点量化方法同时对权值和激活进行量化，由于两部分的量化都存在误差，导致网络很难收敛。针对这个问题，本文提出了一个两阶段低比特量化方法，实现网络权值量化和激活量化解耦。在第一阶段，提出了一种稀疏量化方法来学习隐层的低比特特征表达，此时网络的权值是连续的，因此在梯度下降优化过程中，每一步迭代计算的梯度都可以直接反应到下一步迭代中；在第二阶段，只学习由上一层到当前层的特征变换函数，该问题可以转换为定点约束下的非线性最小二乘问题，通过迭代优化方法可以求得近似最优解。
;
In recent years, deep neural networks (DNNs) have been evolving rapidly and have attracted widespread attention among researchers and developers throughout the world, including both academia and industry. This technology has made great breakthroughs in many fields such as computer vision, speech recognition, and natural language processing, significantly improving the performance of multiple tasks. At present, deep neural networks have been widely used in industries including ecommerce, video surveillance, automatic driving, and auxiliary medical, and have gradually become indispensable basic tools in these intelligent applications.
In addition to more powerful network architectures and more efficient training strategies, two other important factors for the rapid development of DNNs are the massive growth of data and the rapid increase in computing power. Massive labeled data can effectively reduce the overfitting phenomenon, thus allowing developers to design sufficiently large networks. At the same time, due to the development of GPUs, the training of largescale deep neural networks becomes possible. Therefore, as the performance of deep neural networks continues to increase, the network structure also becomes more and more complex. On the other hand, with the maturity of deep neural network technology, the demand for deployment of deep models in various intelligent applications is also increasing rapidly. However, for many resourceconstrained devices and realtime applications, such high computational and storage cost becomes the major obstacle to the deployment of neural networks. Therefore, the study of network acceleration and compression has important theoretical and application values. Improving the computational efficiency of deep neural networks can promote the application of neural network technology in various fields.
Aiming at the problems of acceleration and compression of deep neural networks, this dissertation carries out a series of research from the aspects of lowrank decomposition, fixedpoint quantization. The specific research content and contributions are summarized as follows:

A tensor lowrank and group sparse decomposition based method is proposed for the acceleration and compression of convolutional neural networks (CNNs). In CNN, most of the computation resides in the convolutional layers and the fully connected layers. The basic operation of these layers can be converted to matrix multiplication. Therefore, methods for matrix multiplication acceleration, such as lowrank decomposition and sparsification, can theoretically be used to accelerate convolutional neural networks. However, in order to ensure the accuracy of lowrank decomposition, it is often necessary to select a relatively large rank. Thus the acceleration is very limited. Although sparse based methods can greatly reduce the amount of computation, the actual speedup is also very limited due to the random sparsity. To solve these two problems, this dissertation proposes a convolutional neural network acceleration method based on tensor lowrank and group sparse decomposition. By decomposing the convolutional kernel tensor into the sum of multiple lowrank tensors, the core tensor is sparsely partitioned, which can effectively reduce the computational budget with a large rank. At the same time, due to the structured sparsity, it can obtain high actual speedup.

A fixedpoint matrix decomposition based method is proposed to quantize the weights of deep neural networks into ternary values. Deep neural networks usually use floatingpoint representation. However, floatingpoint operations consume a large amount of hardware resources. Lowbit quantization of network parameters can greatly reduce the resource consumption. However, when extremelylow bit representations are utilized, the accuracy of the quantized neural network usually drops a lot compared with the fullprecision counterpart. To solve this problem, this dissertation proposes a unified framework called Fixedpoint Factorized Network (FFN) to quantize the weights of networks into ternary values. Given a pretrained fullprecision network, a fixedpoint decomposition is performed on its weight matrix. At the same time, a fullprecision weight recovery method is proposed to alleviate the information loss during the fixedpoint decomposition of weight matrix. In addition, we theoretically analyze the gradient mismatch phenomenon which is ubiquitous among matrix/tensor decomposition based accelerating methods, and propose an effective weight balancing technique to alleviate this problem.

A twostep quantization method is proposed to quantize the weights and activations of deep neural networks. Neural networks mainly include two parts in the running phase, i.e., the weights and activations. If we only quantize the weights, there is still a lot of floatingpoint operations. Therefore, activation quantization is also needed. Traditional fixedpoint quantization methods try to quantize the weights and activations at the same time. Due to the quantization error introduced by weight and activation quantization, the network is difficult to converge. To solve this problem, this dissertation proposes a TwoStep Quantization (TSQ) framework for learning lowbit neural networks, which decouples the weight quantization from activation quantization. In the first stage, a sparse quantization method is proposed to learn the lowbit feature representation of the hidden layers. At this time, the weights of the network are continuous. In the second stage, only the feature transformation function from the previous layer to the current layer is learned. This problem can be converted into a nonlinear least squares problem with fixedpoint constraints. An iterative optimization method is proposed to solve the optimization problem.

Edit Comment