基于量化的神经网络加速压缩算法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于量化的神经网络加速压缩算法研究
	陈维汉
	2023-05-20
页数	156
学位类型	博士
中文摘要	凭借着充沛的数据和计算资源，近年来深度神经网络在计算机视觉、语音识别、自然语言处理等人工智能领域大放异彩。但不幸的是，模型的规模反过来以远超硬件算力提升的速度不断扩增，导致无论云端或是边缘端设备均逐渐难以承受其带来的计算、存储以及能耗等方面的负担。在硬件发展难以满足日益增加的模型复杂度的情况下，学术界和工业界将更多目光投注于通过算法层面的策略设计实现神经网络的高效计算。在众多加速和压缩算法策略中，量化因其硬件实现友好、广泛适用不同网络和任务以及加速和压缩并举等特性而成为最受关注的方法之一。然而，量化降低了模型的表示基数，因此不可避免地导致网络性能受损。为此，本文着重于在从不同侧面剖析已有量化工作的基础上，进一步指出存在的问题并提出相应改进策略，从而在相同压缩比、加速比和资源约束下取得更好的性能表现。本文的主要研究成果与贡献归纳如下：（1）基于全局乘积量化的神经网络压缩方法。当前的两阶段神经网络乘积量化方法存在固有的调参耗时以及性能较差等缺陷。为此，本文介绍了全局乘积量化算法G&P PQ，其在提出的权重更新策略基础上将权重量化和码本微调两个独立阶段融合到统一网络训练框架并可实现两者的自动渐进切换，从而能更好捕捉到不同层间复杂的依赖关系并简单高效地解决了上述问题。通过在ImageNet分类数据集上针对不同网络结构开展的广泛实验，证明了其能在模型大小和模型精度间取得显著优于已有网络压缩工作的权衡。以ResNet-50网络为例，G&P PQ在压缩比接近20倍条件下仍保持Top-1分类精度损失小于1%。（2）基于约束优化的神经网络混合精度量化方法。为高效准确地给网络不同层和对象自动分配最优量化位宽从而实现效率和性能间更好的权衡，本文提出全新且原则化的混合精度量化框架。首先，本文将权重混合精度量化表述为离散约束优化问题。然后，为使该问题计算上可行，本文提出了基于黑塞矩阵近似和背包问题重构的高效且无额外超参数的贪心求解算法。最后，本文将激活量化噪声转化为相应权重扰动，从而在上述框架基础上实现激活的混合精度量化。本文提出的混合精度量化方法首先在计算上更加高效，另外其是以原则化的方式导出而非基于启发式策略。本文首先在图像分类和目标检测任务中全面验证了该框架的性能优势，然后通过硬件实验展示得到的混合精度量化网络在实际部署中的推理效率。以ResNet-50网络为例，该混合精度量化框架在权重和激活的平均量化位宽均为4比特条件下仍获得略高于全精度模型的性能表现。（3）基于损失扰动优化的神经网络训练后量化方法。为在数据和计算资源受限场景下尽可能恢复因量化而损失的性能，本文提出高效的基于损失扰动优化的训练后量化框架。具体来说，本文首先指出最后输出层的量化损失是由中间各层激活重建误差所限定的。在此基础上，本文提出以逐层激活重建误差最小化为优化目标从而同时优化量化步长和整型权重向量。此外，本文提出一种基于误差补偿的激活量化策略，从而在实现逐通道量化性能的同时保持与逐层量化相同的计算效率。整个框架是无超参数的，并且可以很容易地实现和集成到当前已有量化方法中。实验结果表明，上述量化框架无需微调便能在将全精度模型量化至3比特条件下仍达到接近原始模型的性能。（4）量化剪枝融合的神经网络加速压缩方法。由于不同类型的加速压缩策略往往从不同侧面降低网络冗余度，为充分利用该互补特性从而实现多方法融合的神经网络加速压缩框架，本文尝试对提出的混合精度量化框架做进一步方法扩展和经验探究。具体来说主要包括几个方面：一是尝试将量化和稀疏两种模型压缩技术融合使用，以进一步提高相同加速和压缩比下网络的性能；二是提出动态批归一化使得以较少计算开销为代价便可大幅提高压缩网络性能预测的相关性；三是将上述模型加速压缩框架扩展应用到自然语言处理模型BERT和自然语言处理任务GLUE中。以ResNet-50网络为例，本文提出的量化剪枝融合策略可在整体压缩比达到36倍的条件下仍保持接近全精度模型的性能表现。
英文摘要	With abundant data and computing resources, deep neural networks have made remarkable achievements in the field of artificial intelligence, such as computer vision, speech recognition, and natural language processing in recent years. Unfortunately, the size of models has been increasing at a rate far surpassing the improvement of hardware capability, making it increasingly challenging for both cloud and edge devices to handle the computational, storage, and energy burdens they bring. In the case where hardware development cannot keep up with the increasing complexity of models, the academic and industrial communities have focused more on designing efficient neural network computations through algorithmic strategies. Among many acceleration and compression algorithm strategies, quantization has become one of the most popular methods due to its hardware-friendly implementation, wide applicability to different networks and tasks, and the combination of acceleration and compression. However, quantization reduces the representation cardinality of the model, inevitably leading to a performance degradation of the network. Therefore, this paper focuses on exploring existing quantization methods from different perspectives and proposing corresponding improvement strategies to achieve higher model performance under the same compression ratio, acceleration ratio, and resource constraints. The main research achievements and contributions of this article are summarized as follows: (1) Towards Deep Neural Network Compression via Global&Progressive Product Quantization. The current two-stage neural network product quantization methods have inherent drawbacks such as time-consuming hyperparameter tuning and poor performance. To address this, this paper proposes the G&P PQ algorithm based on the proposed weight update strategy, which integrates the two independent stages of weight quantization and codebook fine-tuning into a unified network training framework and can achieve automatic progressive switching between the two stages. This approach can better capture the complex dependencies between different layers and effectively solve the aforementioned issues in a simple and efficient manner. Extensive experiments on the ImageNet classification dataset for different network structures demonstrate that it achieves a significant trade-off between model size and model accuracy compared to existing network compression methods. Taking the ResNet-50 network as an example, G&P PQ maintains a Top-1 classification accuracy loss of less than 1% even under a compression ratio close to 20$\times$. (2) Constrained Optimization-based Mixed-Precision Quantization of Deep Neural Network. To achieve a better balance between efficiency and performance by efficiently and accurately allocating optimal quantization bit-widths for different layers and objects in a network, this paper proposes a novel and principled mixed-precision quantization framework. Firstly, weight mixed-precision quantization is formulated as a discrete constrained optimization problem. Then, to make the problem computationally feasible, an efficient and hyperparameter-free greedy algorithm based on Hessian matrix approximation and knapsack problem reconstruction is proposed. Finally, the activation quantization noise is transformed into corresponding weight perturbations, thereby realizing activation mixed-precision quantization on the basis of the above framework. The proposed mixed-precision quantization method is computationally more efficient and is derived in a principled manner rather than based on heuristic strategies. The performance advantages of this framework are comprehensively verified in image classification and object detection tasks, and the hardware experiments demonstrate the inference efficiency of the mixed-precision quantization network in practical deployment.Taking the ResNet-50 network as an example, the proposed mixed-precision quantization framework still achieves slightly better performance than the full-precision model under the condition that the average quantization bit-width of weights and activations is 4-bit. (3) Loss Perturbation Optimization-based Post-training Quantization of Deep Neural Network. To recover the performance lost due to quantization as much as possible in scenarios with limited data and computational resources, this paper proposes an efficient optimization-based post-training quantization framework. Specifically, this paper first points out that the quantization loss of the final output layer is limited by the reconstruction error of intermediate activation layers. Based on this, this paper proposes to minimize the activation reconstruction error layer by layer as the optimization objective, thus optimizing both the quantization step size and the integer weight vector simultaneously. In addition, this paper proposes an error compensation-based activation quantization strategy, which maintains the same computational efficiency as layer-wise quantization while achieving channel-wise quantization performance.The entire framework is hyperparameter-free and can be easily implemented and integrated into existing quantization methods. Experimental results show that the proposed quantization framework can achieve performance close to the original model without fine-tuning when quantizing the full-precision model to 3 bits. (4) Towards Deep Neural Network Acceleration and Compression via Quantization and Pruning. Due to the fact that different types of acceleration and compression strategies often reduce network redundancy from different aspects, in order to fully utilize this complementary feature and achieve a multi-method fusion neural network acceleration and compression framework, this paper attempts to further extend and explore the proposed mixed-precision quantization framework. Specifically, it mainly includes several aspects: firstly, attempting to combine the two model compression techniques of quantization and sparsity to further improve the performance of the network under the same acceleration and compression ratio; secondly, proposing dynamic batch normalization to significantly improve the correlation of compressed network performance prediction at a small computational cost; thirdly, extending the above model acceleration and compression framework to natural language processing models and tasks (e.g. BERT model and GLUE task). Taking the ResNet-50 network as an example, the proposed quantization-pruning strategy can maintain performance close to the full-precision model even under a total compression ratio of 36$\times$.
关键词	深度神经网络模型加速压缩量化剪枝
语种	中文
七大方向——子方向分类	AI芯片与智能计算
国重实验室规划方向分类	智能计算与学习
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52044
专题	毕业生_博士学位论文
通讯作者	陈维汉
推荐引用方式 GB/T 7714	陈维汉. 基于量化的神经网络加速压缩算法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
MyThesis-final.pdf（10338KB）	学位论文		限制开放	CC BY-NC-SA