基于CUDA算子融合和梯度稀疏化的Transformer加速技术研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于CUDA算子融合和梯度稀疏化的Transformer加速技术研究
	李光耀
	2023-05-19
页数	72
学位类型	硕士
中文摘要	近年来，基于 Transformer 的神经网络模型在计算机视觉、自然语言处理、多模态识别等领域取得了巨大的成功。但 Transformer 模型的参数规模庞大，严重制约了其训练和推理性能。为了减少 Transformer 模型训练时对资源的需求，并加速训练和推理，大量加速算法被提出。这些方法主要分为两大类，分别是单机加速和分布式加速，其中分布式加速又分别基于数据并行和模型并行策略。本文根据 Transformer 模型的特点，针对其训练过程，围绕单机计算和分布式数据并行两个方面的优化进行了探索，有效地加快了 Transformer 模型的训练速度。本文的主要贡献如下： (1) 在单机计算优化方面，针对 Transformer 结构计算逻辑复杂，计算效率低的问题，使用算子融合技术融合多个 I/O 密集型算子，显著减少 GPU 上的数据访存，从而加速算子的执行。该方法从深度学习框架底层实现出发，基于 CUDA 编程，将多个算子的计算流程使用一个 CUDA 核函数实现，并使用高速内存保存中间结果以减少对 GPU 全局内存的访问。手动介入 GPU 内存管理，实现多版本核函数，以保证对于任意输入张量均能取得较好性能。融合算子分别就算子执行效率和在 BERT 模型中的加速效果分别实验。实验结果表明本文的融合算子在输入向量长度较低时取得较好的加速效果，并可以在一定程度上加速 BERT 模型的训练。 (2) 在分布式数据并行方面，针对梯度交换过程中通信量极大，以致于影响训练系统加速比的问题，使用梯度稀疏化算法，显著降低梯度交换的通信量。本文首先根据模型训练过程中梯度分布的特点，改进经典梯度稀疏化算法，解决了其无法在神经网络层粒度执行的问题。然后提出了一种新的稀疏向量表示方法，相比经典的键-值（key-value）表示，本方法更加适应于神经网络训练中的稀疏梯度向量的表示，可以对其进一步压缩。最后提出了适应于本文稀疏向量表示方法的稀疏向量通信算法，可以有效平衡各节点的通信负载，从而降低总体的通信时间。分布式通信优化算法在多种模型上进行了实验。实验结果表明各算法均带来了一定的性能提升。本文根据 Transformer 模型的特点，进行了针对性的优化，并取得了一定的加速效果。但算子融合和梯度稀疏化相关算法仍然存在许多亟待解决的问题。在未来，算子融合自动化，高效的梯度稀疏化算法，模型并行的通信优化等方向仍需进一步深入发掘。
英文摘要	In recent years, Transformer-based neural networks have achieved great success in many fields, such as computer vision, natural language processing, and multimodal recognition. However, the size of the Transformer models is large, which restricts its training and inference performance. To reduce the resource requirements for Transformer model training and accelerate training and inference, many acceleration algorithms have been proposed. These methods are mainly divided into two categories: single machine acceleration and distributed acceleration, in which distributed acceleration is based on data parallelism and model parallelism. In this paper, based on the characteristics of the Transformer models, we explore the optimization of single-machine computing and distributed data parallelism, which effectively speeds up the training of the Transformer models.The main contributions of this paper are as follows: (1) In terms of single-machine computing optimization, we addressed the problem of complex calculation logic and low calculation efficiency of Transformer by using operator fusion technology. We fuse multiple I/O-intensive operators, significantly reducing data access on GPUs and speeding up operator execution. This method uses CUDA programming to implement the calculation process of multiple operators with a single CUDA kernel function, and it also uses high-speed memory to store intermediate results to reduce access to GPU global memory. We manually intervene in GPU memory management to implement multiple versions of kernel functions to ensure good performance for any input tensor. The fused operators were experimentally evaluated in terms of execution efficiency and acceleration effect in the BERT model. The experimental results show that the fused operator in this paper achieves good acceleration effect when the input vector length is low and can accelerate the training of the BERT model to a certain extent. (2) In terms of distributed data parallelism, we addressed the problem of high communication volume in gradient exchange, which affects the training system acceleration ratio, by using gradient sparsification algorithm. First, based on the characteristics of the gradient distribution in the model training, we improved the classic gradient sparsification algorithm and solved the problem of its inability to execute at layer-wise of neural network. Then, we proposed a new sparse representation method that is more suitable for representing sparse gradient vectors in neural network training. Compared to the key-value representation, it can further compress the sparse gradient vectors. Finally, we proposed a sparse communication algorithm adapted to the sparse representation method proposed in this paper, which can effectively balance the communication load of each node and reduce the overall communication time. The distributed communication optimization algorithm was experimentally evaluated on multiple models, and the experimental results show that each algorithm brings certain performance improvements. In this paper, we have carried out optimizations based on the characteristics of the Transformer models and achieved certain acceleration effects. However, the operator fusion and gradient sparsification algorithms still have many urgent problems to be solved. In the future, research on automated operator fusion, efficient gradient sparsification algorithms, and communication optimization for model parallelism is needed for further exploration.
关键词	注意力模型 CUDA 算子融合分布式训练梯度稀疏化
语种	中文
七大方向——子方向分类	其他
国重实验室规划方向分类	其他
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51977
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	李光耀. 基于CUDA算子融合和梯度稀疏化的Transformer加速技术研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文-李光耀.pdf（8011KB）	学位论文		限制开放	CC BY-NC-SA