基于量化的神经网络加速压缩算法研究 | |
陈维汉 | |
2023-05-20 | |
页数 | 156 |
学位类型 | 博士 |
中文摘要 | 凭借着充沛的数据和计算资源,近年来深度神经网络在计算机视觉、语音识别、自然语言处理等人工智能领域大放异彩。但不幸的是,模型的规模反过来以远超硬件算力提升的速度不断扩增,导致无论云端或是边缘端设备均逐渐难以承受其带来的计算、存储以及能耗等方面的负担。在硬件发展难以满足日益增加的模型复杂度的情况下,学术界和工业界将更多目光投注于通过算法层面的策略设计实现神经网络的高效计算。在众多加速和压缩算法策略中,量化因其硬件实现友好、广泛适用不同网络和任务以及加速和压缩并举等特性而成为最受关注的方法之一。然而,量化降低了模型的表示基数,因此不可避免地导致网络性能受损。为此,本文着重于在从不同侧面剖析已有量化工作的基础上,进一步指出存在的问题并提出相应改进策略,从而在相同压缩比、加速比和资源约束下取得更好的性能表现。 本文的主要研究成果与贡献归纳如下: (1)基于全局乘积量化的神经网络压缩方法。当前的两阶段神经网络乘积量化方法存在固有的调参耗时以及性能较差等缺陷。为此,本文介绍了全局乘积量化算法G&P PQ,其在提出的权重更新策略基础上将权重量化和码本微调两个独立阶段融合到统一网络训练框架并可实现两者的自动渐进切换,从而能更好捕捉到不同层间复杂的依赖关系并简单高效地解决了上述问题。通过在ImageNet分类数据集上针对不同网络结构开展的广泛实验,证明了其能在模型大小和模型精度间取得显著优于已有网络压缩工作的权衡。 (2)基于约束优化的神经网络混合精度量化方法。 (3)基于损失扰动优化的神经网络训练后量化方法。 (4)量化剪枝融合的神经网络加速压缩方法。 |
英文摘要 | With abundant data and computing resources, deep neural networks have made remarkable achievements in the field of artificial intelligence, such as computer vision, speech recognition, and natural language processing in recent years. Unfortunately, the size of models has been increasing at a rate far surpassing the improvement of hardware capability, making it increasingly challenging for both cloud and edge devices to handle the computational, storage, and energy burdens they bring. In the case where hardware development cannot keep up with the increasing complexity of models, the academic and industrial communities have focused more on designing efficient neural network computations through algorithmic strategies. Among many acceleration and compression algorithm strategies, quantization has become one of the most popular methods due to its hardware-friendly implementation, wide applicability to different networks and tasks, and the combination of acceleration and compression. However, quantization reduces the representation cardinality of the model, inevitably leading to a performance degradation of the network. Therefore, this paper focuses on exploring existing quantization methods from different perspectives and proposing corresponding improvement strategies to achieve higher model performance under the same compression ratio, acceleration ratio, and resource constraints. The main research achievements and contributions of this article are summarized as follows: (1) Towards Deep Neural Network Compression via Global&Progressive Product Quantization. The current two-stage neural network product quantization methods have inherent drawbacks such as time-consuming hyperparameter tuning and poor performance. To address this, this paper proposes the G&P PQ algorithm based on the proposed weight update strategy, which integrates the two independent stages of weight quantization and codebook fine-tuning into a unified network training framework and can achieve automatic progressive switching between the two stages. This approach can better capture the complex dependencies between different layers and effectively solve the aforementioned issues in a simple and efficient manner. Extensive experiments on the ImageNet classification dataset for different network structures demonstrate that it achieves a significant trade-off between model size and model accuracy compared to existing network compression methods. (2) Constrained Optimization-based Mixed-Precision Quantization of Deep Neural Network. To achieve a better balance between efficiency and performance by efficiently and accurately allocating optimal quantization bit-widths for different layers and objects in a network, this paper proposes a novel and principled mixed-precision quantization framework. Firstly, weight mixed-precision quantization is formulated as a discrete constrained optimization problem. Then, to make the problem computationally feasible, an efficient and hyperparameter-free greedy algorithm based on Hessian matrix approximation and knapsack problem reconstruction is proposed. Finally, the activation quantization noise is transformed into corresponding weight perturbations, thereby realizing activation mixed-precision quantization on the basis of the above framework. The proposed mixed-precision quantization method is computationally more efficient and is derived in a principled manner rather than based on heuristic strategies. The performance advantages of this framework are comprehensively verified in image classification and object detection tasks, and the hardware experiments demonstrate the inference efficiency of the mixed-precision quantization network in practical deployment.Taking the ResNet-50 network as an example, the proposed mixed-precision quantization framework still achieves slightly better performance than the full-precision model under the condition that the average quantization bit-width of weights and activations is 4-bit. (4) Towards Deep Neural Network Acceleration and Compression via Quantization and Pruning. Due to the fact that different types of acceleration and compression strategies often reduce network redundancy from different aspects, in order to fully utilize this complementary feature and achieve a multi-method fusion neural network acceleration and compression framework, this paper attempts to further extend and explore the proposed mixed-precision quantization framework. Specifically, it mainly includes several aspects: firstly, attempting to combine the two model compression techniques of quantization and sparsity to further improve the performance of the network under the same acceleration and compression ratio; secondly, proposing dynamic batch normalization to significantly improve the correlation of compressed network performance prediction at a small computational cost; thirdly, extending the above model acceleration and compression framework to natural language processing models and tasks (e.g. BERT model and GLUE task). Taking the ResNet-50 network as an example, the proposed quantization-pruning strategy can maintain performance close to the full-precision model even under a total compression ratio of 36$\times$. |
关键词 | 深度神经网络 模型加速压缩 量化 剪枝 |
语种 | 中文 |
七大方向——子方向分类 | AI芯片与智能计算 |
国重实验室规划方向分类 | 智能计算与学习 |
是否有论文关联数据集需要存交 | 否 |
文献类型 | 学位论文 |
条目标识符 | http://ir.ia.ac.cn/handle/173211/52044 |
专题 | 毕业生_博士学位论文 |
通讯作者 | 陈维汉 |
推荐引用方式 GB/T 7714 | 陈维汉. 基于量化的神经网络加速压缩算法研究[D],2023. |
条目包含的文件 | ||||||
文件名称/大小 | 文献类型 | 版本类型 | 开放类型 | 使用许可 | ||
MyThesis-final.pdf(10338KB) | 学位论文 | 限制开放 | CC BY-NC-SA |
个性服务 |
推荐该条目 |
保存到收藏夹 |
查看访问统计 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[陈维汉]的文章 |
百度学术 |
百度学术中相似的文章 |
[陈维汉]的文章 |
必应学术 |
必应学术中相似的文章 |
[陈维汉]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论