基于定点量化和稀疏的神经网络加速与压缩

CASIA OpenIR > 毕业生 > 博士学位论文

	基于定点量化和稀疏的神经网络加速与压缩
	许伟翔
	2023-05
页数	126
学位类型	博士
中文摘要	近年来，深度神经网络在诸多应用中大放异彩，但随着其性能不断提升，网络结构变得越来越复杂，网络的计算量和所需存储也随之变大。训练和推理阶段高昂的计算代价成为阻碍深度神经网络落地和部署的主要障碍。因此，研究深度神经网络的加速与压缩方法，对于进一步提升神经网络的运行效率，促进深度神经网络在各领域的落地具有重要的理论意义和应用价值。为此，本文针对深度神经网络的加速和压缩问题，从定点量化和稀疏的角度展开了以下研究：针对三值神经网络在梯度反向传播中优化困难的问题，本文提出了软阈值三值神经网络量化方法。在以往的三值方法中，存在一个手工设计的硬阈值来将浮点型权值映射到三值。本文通过分析发现该硬阈值会引入不必要的约束条件，进而限制三值网络表达能力。基于该发现，本文提出基于软阈值的三值量化方法，去除对硬阈值的依赖，大幅提升了三值网络在大规模数据集上的准确率。针对二值神经网络表征能力受限的问题，本文提出了隐变量增强二值神经网络量化方法。在以往的二值训练方案中，全精度权值仅作为隐变量累积梯度信息，而作为全精度特征提取器的能力被忽视。本文首先通过重计算批量归一化层统计量和替换二值激活函数的方式，恢复全精度隐变量的表征能力，并将其加入计算图。此外，本文还设计了加入标签信息的特征近似损失函数，该监督一方面使全精度特征和二值特征呈现相似的分布，另一方面使具有相同分类标签的高层语义特征聚集得更加紧密，进而缩小二值模型和浮点模型之间的性能差距。针对任意比特位宽的量化感知训练，本文提出软阈值定点量化神经网络方法。本文首先分析了现有量化器中舍入函数的局限性，包括降低离散量化区间的灵活性、限制定点量化可行解空间的大小两方面。在此基础上，本文将软阈值思想从三值扩展至任意比特位宽，使得离散值通过训练自适应确定，而不再依赖固定的分段函数。最后本文基于FPGA设计专用量化加速器，在大规模分类和检测任务上验证量化方案的准确率与速度。针对深度神经网络训练代价大且耗时长的问题，本文提出完全稀疏的训练加速方法。本文首先针对英伟达安培架构GPU上的训练过程进行稀疏敏感性分析，选择出对结构化剪枝鲁棒的稀疏对象。在此基础上，本文分别对前向传播、反向传播、权值梯度更新三个阶段设计针对性稀疏方法，在保证高效在线稀疏的同时尽可能减少信息丢失。在分类、检测和分割任务上的实验表明该方法在几乎没有准确率损失的前提下可以实现2倍训练加速。
英文摘要	Most recently, deep neural networks (DNNs) have shown remarkable performance in various applications. However, as the performance of DNNs continues to improve, the network structures become increasingly complex, and the computational and storage requirements also increase accordingly. The high computational cost of training and inference is the main obstacle to the deployment of DNNs. Therefore, research on acceleration and compression methods is of great significance for further improving the efficiency of deep neural networks. In this paper, we investigate the acceleration and compression of DNNs from the perspectives of fixed-point quantization and sparsity, and propose the following approaches: To address the optimization difficulties of ternary neural networks in gradient backpropagation, we propose the soft threshold ternary network method. In previous ternary methods, a manually designed hard threshold is used to map floating-point weights to ternary weights. We first analyze that the hard threshold introduces unnecessary constraints, which limits the expression power of the ternary network. Based on it, we propose a soft threshold ternary quantization method that removes the dependence on the hard threshold and significantly improves the accuracy of ternary networks on large-scale datasets. To enhance the representation ability of binary neural networks, we propose latent-variable-enhanced binary neural networks. In previous binary training schemes, the full-precision weights are only used as latent variables to accumulate gradient information, while their capacity as full-precision feature extractors is ignored. We first restore the representation power of full-precision latent variables by recalculating BN layer statistics and replacing binary activation functions, and add them to the computation graph. In addition, we design a feature approximation loss function that incorporates label information, which not only makes the full-precision features and binary features present similar distributions but also makes high-level semantic features with the same classification labels more tightly clustered, thereby reducing the performance gap between binary and full-precision models. To achieve quantization-aware training for arbitrary bit widths, we propose the soft threshold fixed-point quantization. We first analyze the limitations of rounding functions in existing quantizers, including reduced flexibility in discrete quantization intervals and limited feasible solution space for fixed-point quantization. Based on this analysis, we extend the soft threshold idea from ternary to arbitrary bit-widths, allowing the discrete values to be adaptively determined during training without relying on fixed segmentation functions. Finally, we design a dedicated quantization accelerator on FPGA to validate the accuracy and speed of the quantization scheme on large-scale classification and detection tasks. To address the problem of high training cost and long training time of DNNs, we propose Fully Sparse Training (FST) method. We first conduct sparse sensitivity analysis of the training process on NVIDIA Ampere architecture GPUs and select sparse objects that are robust to structured pruning. Based on this analysis, we design targeted sparse methods for forward propagation, backpropagation, and weight gradient updating to ensure efficient online sparsity while minimizing information loss. Experimental results on classification, detection, and segmentation tasks show that this method can achieve 2$\times$ training acceleration with almost no accuracy loss.
关键词	低比特量化结构化剪枝模型压缩深度卷积神经网络
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	智能计算与学习
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51945
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	许伟翔. 基于定点量化和稀疏的神经网络加速与压缩[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于定点量化和稀疏的神经网络加速与压缩_（3472KB）	学位论文		限制开放	CC BY-NC-SA