基于软硬件协同设计的深度学习模型压缩与加速

CASIA OpenIR > 毕业生 > 博士学位论文

	基于软硬件协同设计的深度学习模型压缩与加速
	刘泽健
	2023-05
页数	122
学位类型	博士
中文摘要	近年来以深度神经网络为核心的深度学习在人工智能领域的诸多任务上取得了接近乃至超越人类的水平，但是由于深度神经网络大多具有极高的计算量和存储需求，无论是在云端服务器上部署，还是转移到智能手机等边缘设备上，都面临着延迟高、能耗大的问题。为了更加高效地完成深度神经网络的训练以及推理，研究者从算法和硬件两个角度提出了多种解决方案。在算法方面，出现了一系列尝试降低模型需要的计算量和存储量的模型压缩方法，比如使用低比特数据类型存储模型的模型量化、去除模型中冗余计算的模型剪枝和设计更加高效的计算模块的轻量化模型设计。在硬件方面则是设计更加适合深度神经网络运行的领域专用架构芯片，即深度神经网络加速器。早期的深度神经网络加速器研究主要集中在对架构的优化上，通过并行运算和数据复用等方法提高计算效率。但是随着架构优化带来的收益逐渐接近上限，研究者开始将算法设计和加速器设计结合，提出了如面向量化模型、面向稀疏模型的加速器等。通过结合模型压缩带来的计算量降低等收益和专用架构的高效性，进一步提高了加速器的性能。这种同时考虑算法设计和加速器设计的方法论便是软硬件协同设计。本文以软硬件协同设计为主要方法论，以提高模型的执行效率为主要目标，分别探讨了硬件友好的模型压缩方法、专用加速器的架构设计和模型结构与加速器架构的自动优化三个问题，其设计空间不断扩大，因此可带来的性能提升逐渐增加。本文的主要研究内容和创新点如下： 1）针对使用静态剪枝的模型难以兼具硬件友好性和高准确率的问题，本文提出了一种动态结构化剪枝方法。具体来说，考虑到对于不同的输入，模型中同一计算部分对于模型性能的贡献会发生改变，本文实现了可以根据给定的输入自适应判断属于冗余计算的部分并进行剪枝的动态剪枝模型，其核心机制是在模型中额外添加的可以判断计算是否冗余的预测器。进一步地，为了能够更好地控制模型的稀疏度，并提高模型的性能，本文对预测器的结构设计和模型的训练方式进行了深入探讨，提出了多项改进策略，使得模型可以在计算量和准确率上实现更好的平衡。 2）针对BERT模型在通用处理器上计算速度慢，且先前的卷积神经网络加速器难以高效处理BERT模型的问题，本文基于软硬件协同设计的方法提出了一个解决方案。具体来说，本文首先提出了一种面向BERT模型的量化方法，降低了模型的存储需求。然后通过分析模型的执行情况，挖掘其中存在的并行计算和数据复用，设计了面向BERT模型的专用加速器架构。实验表明，该加速器可以在速度和能效比上大幅超越通用处理器，再次证明了软硬件协同设计的有效性。 3）针对过往的自动化模型结构、加速器架构优化方法优化效率低的问题，本文提出了一种更加高效的优化方法，并实现了一个基于强化学习的优化框架。具体地，过去的自动优化工作中大多没有显式引入模型结构和加速器架构间的相互影响，因此搜索过程缺少指导，需要花费大量的时间。本文则提出了一种显式利用模型和加速器的相互联系的优化方法，并基于该方法实现了一个使用强化学习作为优化算法的自动优化框架。在多个数据集上的实验表明，本文提出的方法可以显著缩短优化的时间，并提高优化方案的性能。
英文摘要	In recent years, deep learning, with deep neural networks (DNNs) as its core, has achieved close to or even surpassed human levels in many artificial intelligence tasks. However, due to the extremely high computing and storage requirements of DNNs, the deployment of DNNs on cloud servers or edge devices is faced with high latency and energy consumption issues. To efficiently complete the training and inference of DNNs, researchers have proposed various solutions from algorithmic and hardware perspectives. In terms of algorithms, a series of model compression methods that attempt to reduce the amount of computation and storage required by the model are proposed, such as model quantization using low-bit data types to store the model, model pruning removing redundant calculations, and lightweight model design to design more efficient computing modules. In terms of hardware, researchers try to design domain-specific architecture chips (accelerators) that are more suitable for the inference of DNNs. Early research on DNN accelerators mainly focused on optimizing the architecture, which tries to improve computational efficiency through parallel computing and data reuse. However, as the benefits of architecture optimization gradually approach the upper limit, researchers begin to combine algorithm design with accelerator design, proposing accelerators that support quantized models or sparse models. The accelerator's performance is further improved by combining the benefits of model compression (e.g., computational complexity reduction) and the efficiency of dedicated architectures. This methodology that considers both algorithm design and accelerator design is hardware/software co-design. Based on the main methodology of hardware/software co-design, and with the main goal of improving the execution efficiency of models, this paper discusses three issues: hardware-friendly model compression methods, architecture design of dedicated accelerators, and automatic optimization of model structures and accelerator architectures. The design space of three questions continues to expand, resulting in a gradual increase in performance improvements. The main research content and innovation points of this paper are as follows: 1) To solve the problem that models with static pruning are difficult to have both hardware friendliness and high accuracy, this paper proposes a dynamic structured pruning method. Specifically, considering that the contribution of the same computational part to model performance may change for different inputs, this paper implements a dynamic pruning model that can adaptively determine and prune the parts that belong to redundant computing based on a given input instance. The core mechanism is to add additional predictors to the model that can determine whether the calculation is redundant. Further, in order to better control the sparsity of the model and improve the model's performance, this paper conducts an in-depth discussion on the structural design of the predictors and the model's training method and proposes a number of improvement strategies, which makes the model achieve a better balance between computational complexity and accuracy. 2) In response to the problem of high inference latency of the BERT model on general-purpose processors and previous convolutional neural network accelerators unsuitable for the BERT model, this paper proposes a solution based on the hardware/software co-design methodology. Specifically, this paper first proposes a quantization method for BERT, which reduces the storage requirements of the model. Then, a dedicated accelerator architecture is designed by analyzing the execution of the model, mining the existing parallel computing and data reuse. Experiments show that the accelerator can significantly surpass general-purpose processors in speed and energy efficiency, which proves the effectiveness of hardware/software co-design. 3) In response to the problem that the efficiency of previous methods for optimizing the model and accelerator architecture is low, this paper proposes a more efficient optimization method and implements an optimization framework based on reinforcement learning. Specifically, most previous works did not explicitly introduce the interaction between model structure and accelerator architecture, so the search process lacks guidance and requires a lot of time to complete the search. Based on this observation, this paper proposes an optimization method that explicitly utilizes the interrelationship between models and accelerators, and implements an automatic optimization framework based on reinforcement learning. Experiments on multiple datasets show that the proposed method can significantly shorten the optimization time and improve the performance of the optimization results.
关键词	软硬件协同设计模型压缩 DNN 加速器自动化优化
学科领域	计算机系统结构
语种	中文
七大方向——子方向分类	AI芯片与智能计算
国重实验室规划方向分类	其他
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52031
专题	毕业生_博士学位论文复杂系统认知与决策实验室_高效智能计算与学习
推荐引用方式 GB/T 7714	刘泽健. 基于软硬件协同设计的深度学习模型压缩与加速[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文-2023-06-03-08-2（10064KB）	学位论文		限制开放	CC BY-NC-SA