英文摘要 | In today’s era of rapid development of unmanned technology, multi-agent multitarget
coverage tasks are receiving significant attention in areas such as communication
and military operations. Traditional optimization methods model optimization objectives
and constraints for specific coverage scenarios, requiring extensive expert knowledge
and being ill-suited for handling dynamic sequential decision-making tasks. Deep
Reinforcement Learning (DRL), as a category of methods making a significant impact
in the decision-making field, is characterized by its capacity to interact with and learn
from the environment, flexible design of reward functions, and adaptability to highdimensional
inputs. Therefore, it has also garnered widespread attention for multi-target
coverage tasks. However, the high training cost and weak transfer generalization of DRL
methods make them less agile in complex and variable multi-target coverage scenarios,
posing new challenges for the field.
In recent years, Deep Reinforcement Learning (DRL) has achieved remarkable accomplishments
in the decision control domain. Despite this, DRL algorithms exhibit
low flexibility and adaptability when dealing with complex and variable real-world applications,
hindering DRL’s application in the real world. In many application scenarios,
multi-agent multi-target coverage tasks involve factors such as flexible and variable
scenes, continuous and highly dynamic behaviors, and multi-element coupling, representing
typical complex real-world application tasks. This paper takes this task as a
scenario for researching the transfer generalization problem of DRL.
Accordingly, this paper focuses on the multi-agent multi-target coverage task as the
research scenario and primarily studies the transfer generalization capabilities of DRL
methods under such complex decision-making tasks. The main content and innovations
of this paper are as follows:
(1) To address the problem of policy state space exploration difficulties brought by
increased scenario elements, a twin-observation curriculum learning algorithm is proposed
for multi-agent policy transfer in new environments. To handle the issues posed
by an increased number of targets, which complicates the processing of observational
information, clustering methods are used to extract key information from observations
and combine them with original information as input features for the policy frontend. To
counteract the state space explosion problem caused by an increased number of agents,
a curriculum learning method is employed to achieve gradual learning from simple to
complex environments. Experiments demonstrate that the twin-observation module can
effectively deal with the issues of increasing observation dimensions due to a larger
number of targets, while the curriculum learning mechanism improves the training ef-ficiency and policy performance as the scale of agents increases.
(2) To avoid overfitting the policy towards the training scenarios, a policy learning
framework with stochastic environment generation and shared feature extraction is
proposed to enhance the zero-shot generalization capability of policies in multi-target
coverage tasks. To tackle the issue of a single source of data during the training process,
a series of environmental parameter distributions are designed to randomize the
generation of training environments, enhancing policy generalization from a data augmentation
perspective. To address catastrophic forgetting under the curriculum learning
training mechanism, a domain adaptation-based common feature extraction module
is devised, which extracts prior knowledge from the environment to assist downstream
policy learning. Experiments verify that domain randomization effectively addresses algorithm
overfitting to single scenarios, and domain adaptation improves the algorithm’s
transfer generalization ability in new environments.
(3) To tackle the problem of quantitatively assessing the transfer generalization
ability of reinforcement learning policies, a policy transfer generalization performance
evaluation framework oriented towards environmental differences is proposed, and two
methods are introduced under this framework. From the perspective of policy performance,
statistical measures constructed using direct reflections of policy performance
—reward values—are used to assess changes in policy reward values in new environments.
Three different methods of computing reward values using the reward function
from the source environment are used to pinpoint which specific modules are affected
by environmental shifts. From the perspective of policy behavior, statistical measures
using policy trajectories in source and target environments assess changes in policy trajectories
in new environments. Experiments validate the effectiveness and flexibility of
the two methods proposed in this chapter compared to other baseline algorithms.
To sum up, this paper takes the application of deep reinforcement learning in multiagent
multi-target coverage tasks as the task scenario and investigates the transfer generalization
issues of DRL methods in such complex decision-making scenarios. The research
findings of this paper provide algorithmic support for the transfer generalization
and practical deployment of multi-target coverage tasks, as well as offering solutions
for the transfer generalization problems of reinforcement learning. |
修改评论