In recent years, deep neural networks (DNNs) have been evolving rapidly and have attracted widespread attention among researchers and developers throughout the world, including both academia and industry. This technology has made great breakthroughs in many fields such as computer vision, speech recognition, and natural language processing, significantly improving the performance of multiple tasks. At present, deep neural networks have been widely used in industries including e-commerce, video surveillance, automatic driving, and auxiliary medical, and have gradually become indispensable basic tools in these intelligent applications.
In addition to more powerful network architectures and more efficient training strategies, two other important factors for the rapid development of DNNs are the massive growth of data and the rapid increase in computing power. Massive labeled data can effectively reduce the over-fitting phenomenon, thus allowing developers to design sufficiently large networks. At the same time, due to the development of GPUs, the training of large-scale deep neural networks becomes possible. Therefore, as the performance of deep neural networks continues to increase, the network structure also becomes more and more complex. On the other hand, with the maturity of deep neural network technology, the demand for deployment of deep models in various intelligent applications is also increasing rapidly. However, for many resource-constrained devices and real-time applications, such high computational and storage cost becomes the major obstacle to the deployment of neural networks. Therefore, the study of network acceleration and compression has important theoretical and application values. Improving the computational efficiency of deep neural networks can promote the application of neural network technology in various fields.
Aiming at the problems of acceleration and compression of deep neural networks, this dissertation carries out a series of research from the aspects of low-rank decomposition, fixed-point quantization. The specific research content and contributions are summarized as follows:
A tensor low-rank and group sparse decomposition based method is proposed for the acceleration and compression of convolutional neural networks (CNNs). In CNN, most of the computation resides in the convolutional layers and the fully connected layers. The basic operation of these layers can be converted to matrix multiplication. Therefore, methods for matrix multiplication acceleration, such as low-rank decomposition and sparsification, can theoretically be used to accelerate convolutional neural networks. However, in order to ensure the accuracy of low-rank decomposition, it is often necessary to select a relatively large rank. Thus the acceleration is very limited. Although sparse based methods can greatly reduce the amount of computation, the actual speedup is also very limited due to the random sparsity. To solve these two problems, this dissertation proposes a convolutional neural network acceleration method based on tensor low-rank and group sparse decomposition. By decomposing the convolutional kernel tensor into the sum of multiple low-rank tensors, the core tensor is sparsely partitioned, which can effectively reduce the computational budget with a large rank. At the same time, due to the structured sparsity, it can obtain high actual speedup.
A fixed-point matrix decomposition based method is proposed to quantize the weights of deep neural networks into ternary values. Deep neural networks usually use floating-point representation. However, floating-point operations consume a large amount of hardware resources. Low-bit quantization of network parameters can greatly reduce the resource consumption. However, when extremely-low- bit representations are utilized, the accuracy of the quantized neural network usually drops a lot compared with the full-precision counterpart. To solve this problem, this dissertation proposes a unified framework called Fixed-point Factorized Network (FFN) to quantize the weights of networks into ternary values. Given a pre-trained full-precision network, a fixed-point decomposition is performed on its weight matrix. At the same time, a full-precision weight recovery method is proposed to alleviate the information loss during the fixed-point decomposition of weight matrix. In addition, we theoretically analyze the gradient mismatch phenomenon which is ubiquitous among matrix/tensor decomposition based accelerating methods, and propose an effective weight balancing technique to alleviate this problem.
A two-step quantization method is proposed to quantize the weights and activations of deep neural networks. Neural networks mainly include two parts in the running phase, i.e., the weights and activations. If we only quantize the weights, there is still a lot of floating-point operations. Therefore, activation quantization is also needed. Traditional fixed-point quantization methods try to quantize the weights and activations at the same time. Due to the quantization error introduced by weight and activation quantization, the network is difficult to converge. To solve this problem, this dissertation proposes a Two-Step Quantization (TSQ) framework for learning low-bit neural networks, which decouples the weight quantization from activation quantization. In the first stage, a sparse quantization method is proposed to learn the low-bit feature representation of the hidden layers. At this time, the weights of the network are continuous. In the second stage, only the feature transformation function from the previous layer to the current layer is learned. This problem can be converted into a nonlinear least squares problem with fixed-point constraints. An iterative optimization method is proposed to solve the optimization problem.