英文摘要 | Most recently, deep neural networks (DNNs) have shown remarkable performance in various applications. However, as the performance of DNNs continues to improve, the network structures become increasingly complex, and the computational and storage requirements also increase accordingly. The high computational cost of training and inference is the main obstacle to the deployment of DNNs. Therefore, research on acceleration and compression methods is of great significance for further improving the efficiency of deep neural networks.
In this paper, we investigate the acceleration and compression of DNNs from the perspectives of fixed-point quantization and sparsity, and propose the following approaches:
To address the optimization difficulties of ternary neural networks in gradient backpropagation, we propose the soft threshold ternary network method. In previous ternary methods, a manually designed hard threshold is used to map floating-point weights to ternary weights. We first analyze that the hard threshold introduces unnecessary constraints, which limits the expression power of the ternary network. Based on it, we propose a soft threshold ternary quantization method that removes the dependence on the hard threshold and significantly improves the accuracy of ternary networks on large-scale datasets.
To enhance the representation ability of binary neural networks, we propose latent-variable-enhanced binary neural networks. In previous binary training schemes, the full-precision weights are only used as latent variables to accumulate gradient information, while their capacity as full-precision feature extractors is ignored. We first restore the representation power of full-precision latent variables by recalculating BN layer statistics and replacing binary activation functions, and add them to the computation graph. In addition, we design a feature approximation loss function that incorporates label information, which not only makes the full-precision features and binary features present similar distributions but also makes high-level semantic features with the same classification labels more tightly clustered, thereby reducing the performance gap between binary and full-precision models.
To achieve quantization-aware training for arbitrary bit widths, we propose the soft threshold fixed-point quantization. We first analyze the limitations of rounding functions in existing quantizers, including reduced flexibility in discrete quantization intervals and limited feasible solution space for fixed-point quantization. Based on this analysis, we extend the soft threshold idea from ternary to arbitrary bit-widths, allowing the discrete values to be adaptively determined during training without relying on fixed segmentation functions. Finally, we design a dedicated quantization accelerator on FPGA to validate the accuracy and speed of the quantization scheme on large-scale classification and detection tasks.
To address the problem of high training cost and long training time of DNNs, we propose Fully Sparse Training (FST) method. We first conduct sparse sensitivity analysis of the training process on NVIDIA Ampere architecture GPUs and select sparse objects that are robust to structured pruning. Based on this analysis, we design targeted sparse methods for forward propagation, backpropagation, and weight gradient updating to ensure efficient online sparsity while minimizing information loss. Experimental results on classification, detection, and segmentation tasks show that this method can achieve 2$\times$ training acceleration with almost no accuracy loss. |
修改评论