CASIA OpenIR  > 毕业生  > 硕士学位论文
面向异构集群的分布式训练与优化算法
晁永越
2023-05-19
Pages96
Subtype硕士
Abstract

随着深度学习在自然语言处理,图像处理等方面的快速发展,数据集和网络 模型样本变得愈发复杂,这使得单机很难高效地完成训练任务。计算机集群上的 分布式并行训练方法应运而生。但在节点性能不同的异构集群中存在着慢节点 问题,即主流的分布式深度学习算法迫使所有的节点在分布式训练过程中等待 最慢节点,从而导致训练性能下降。为了优化分布式训练中异构集群的慢节点问 题,本文将从任务分配与分布式 SGD 算法两方面,探索面向异构集群的分布式 训练算法。

(1)基于全局节点信息的自适应集中式任务分配算法:为了使异构集群中各 节点获取到与之算力匹配的任务量,本文将等量任务分配改进为基于全局节点 信息的的自适应任务分配算法。该算法建立全局节点信息与新一轮各节点任务配比的数学模型获取每一轮各节点所需任务量占总任务量的比例,,分发任务量并调整 minibatch 大小,从而最小化节点相互等待时间,提高分布式训练性能。

(2)基于本地节点信息的自适应分布式任务分配算法:本文改进集中式任务 分配为分布式任务分配,提出基于本地节点信息的自适应分布式任务分配算法, 仅依靠节点自身的梯度计算时间,逐步递减或递增每轮的任务配比,平衡全局的 任务量。该算法基于设定时间阈值与自身梯度计算时间差,设定不同的变化步 长,提出不同的分布式分配策略,达到减少计算资源,提高训练速度的目的。

(3)基于动态局部环规约的异步 SGD 算法:由于同步 SGD 依赖全局环规 约,为了减少通信损耗,本文提出基于动态局部环规约的分布式异步 SGD 算 法。在每次参数融合时,通过随机设定的特殊节点收集同步等待区的节点,构建Partial-Reduce 的通信组,减少参数融合浪费的时间。本文将真实世界的异构集 群进行划分,借此提出策略性的动态局部环规约,通过有策略地设定特殊节点, 实现高效的动态局部环规约。除此之外,本文给出动态局部环规约算法中网络模型收敛性的数学分析。

上述方法在异构集群中能有效地提升分布式 SGD 算法的分布式训练速度, 增加网络训练时的吞吐量,能够为当前的分布式并行训练提供更广阔的思路与基础。

Other Abstract

ith the rapid development of deep learning in natural language processing and image processing, data sets and network models have become increasingly complex, which makes it difffcult for a single computer to efffciently complete training tasks. Distributed parallel training methods on computer clusters have emerged as the times require. However, in heterogeneous clusters with different nodes, the straggler problem appeared, which means that mainstream distributed deep learning algorithms force all nodesto wait for the slowest node in the distributed training process, resulting in reduced training performance. In order to optimize the straggler problem of heterogeneous clusters in distributed training, this paper will explore distributed training algorithms for heterogeneous clusters from both task allocation and distributed SGD algorithms.

(1) Adaptive centralized task allocation algorithm based on global node information:In order to enable each worker in a heterogeneous cluster to obtain tasks that match its computational power, this paper improves the equivalent task allocation in the original distributed training algorithm to an adaptive task allocation, algorithm based on global information. The algorithm establishes a mathematical model for the new epoch of task allocation with the global information for each worker to obtain the proportion of the task volume required by each worker in the total task volume for each epoch, distributes the task volume, and adjusts the minibatch size, thereby minimizing the waiting time between workers and improving distributed training performance.

(2) Adaptive Distributed Task Allocation Algorithm Based on Local Node Information: This paper improves centralized task allocation to distributed task allocation and proposes an adaptive distributed task allocation algorithm based on local information. It only relies on the gradient computation time of the worker itself gradually decreasing or increasing the task ratio of each epoch, thereby balancing the global task volume. Based on setting the time difference between the time threshold and its own gradient computation, the algorithm sets different change steps, proposes different distributed allocation strategies to complete the overall distributed parallel training, achieved the goal of reducing computing resources and improved training speed.

(3) Asynchronous SGD algorithm based on dynamic local ring protocol: Due to synchronous SGD relying on All-Reduce, in order to reduce communication losses, this paper proposes a distributed asynchronous SGD algorithm based on dynamic PartialReduce. During each parameter fusion operation, the workers in the synchronization waiting area are collected through randomly set special workers, and a Partial-Reduce communication group is constructed to reduce the time wasted in parameter fusion. This paper divides heterogeneous clusters in the real world, and proposes a strategic dynamic Partial-Reduce. By strategically setting special nodes, an efffcient dynamic Partial-Reduce is implemented. In addition, this paper presents a mathematical analysis of the convergence of the network model in the dynamic Partial-Reduce algorithm.

The above methods can effectively improve the distributed training speed of the synchronous SGD algorithm in heterogeneous clusters, increase the throughout during network training, and accelerate the convergence of the network model, which can provide a broader idea and foundation for current distributed parallel training.

Keyword分布式深度学习 异构集群 任务分配 异步SGD算法
Subject Area计算机科学技术 ; 人工智能其他学科 ; 并行处理
MOST Discipline Catalogue工学 ; 工学::计算机科学与技术(可授工学、理学学位)
Language中文
IS Representative Paper
Sub direction classification智能计算系统
planning direction of the national heavy laboratory实体人工智能系统(软、硬件)
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/52041
Collection毕业生_硕士学位论文
Corresponding Author晁永越
Recommended Citation
GB/T 7714
晁永越. 面向异构集群的分布式训练与优化算法[D],2023.
Files in This Item:
File Name/Size DocType Version Access License
面向异构集群的分布式训练与优化算法.pd(6482KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[晁永越]'s Articles
Baidu academic
Similar articles in Baidu academic
[晁永越]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[晁永越]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.