CASIA OpenIR  > 毕业生  > 博士学位论文
推理机制启发的视觉语言导航
安东
2024-05
Pages118
Subtype博士
Abstract

视觉语言导航任务旨在赋予机器人根据自然语言指令在陌生环境中导航的能力。随着劳动力成本上升、人口老龄化加剧以及疾病种类增多,作为具身智能领域的关键基础任务,视觉语言导航任务受到了广泛关注。目前,数据驱动的隐式学习算法在视觉语言导航任务中取得了显著进展,但受限于数据稀缺性和任务复杂性,这些方法在导航策略的泛化性和可控性方面仍面临挑战。本研究旨在显式地建模视觉语言导航中的复杂跨模态推理机制,减少对数据量的依赖,实现可解释、带规划的机器人导航行为,以更好地泛化陌生环境。

 

对于复杂环境下的指令遵循导航任务,机器人首先需要具备推理复杂实体关系的能力,建立物体与物体、物体与环境、以及物体与机器人本体之间的多样语义关联。其次,机器人应具备空间认知推理能力,以理解和推理室内环境布局,形成对环境空间的记忆与认知,支撑路径规划等行为。最后,机器人应具备规划与控制的协同能力,建立观测与实际动作执行之间的高效映射,提升导航行为的可控性与学习效率。针对现有方法的不足,本研究借鉴认知科学中的模块化推理、层次化认知地图、子目标推理等理论,层层递进地开展了如下三项工作。

  • 在视觉语言导航中,指令与环境的实体组合关系复杂多样,现有方法往往隐式地学习跨模态实体关联,导致机器人对新环境或新实体组合的泛化能力有限。为了解决这一问题,本工作借鉴了人脑的模块化认知理论,该理论认为人脑通过不同的功能模块处理不同类型的信息,并通过模块间的协同作用实现复杂的认知任务。基于这一理论,本工作提出了一种模块化的邻居视图增强模型。该方法通过归纳指令中的短语类型,将长指令解耦为多种短语表示,并送入不同的语义模块进行处理。这些模块的输出随后被送入一个融合网络,以自适应地整合不同模块的推理结果,形成导航的整体决策。通过该模块化方法,机器人能够更有效地理解和处理复杂的实体关系,提高了在新环境中的泛化能力。
  • 以上的模块化推理方法增强了机器人对复杂实体组合的理解能力,但机器人在空间认知方面仍存在不足,尤其在指令涉及复杂的跨视图空间关系推理时。为了解决这一问题,本工作借鉴了认知地图理论,该理论将大脑的位置细胞、网格细胞论证为认知地图的神经元基础,它们层次化交互,促进个体的空间理解与路径规划。 基于这一理论,本工作提出了一种基于混合地图的预训练算法。该混合地图包含一个局部的度量地图和一个全局的拓扑地图,其中度量地图类似于位置细胞,负责局部的空间推理,而拓扑地图类似于网格细胞,负责全局的导航规划。本工作进一步设计了基于地图的预训练方案以学习统一的多模态地图表示,通过混合地图的内部协同,机器人在长程规划与短期空间推理之间取得了有效的平衡。
  • 上述方法有效提升了机器人的实体关系建模和空间推理能力,但在实际的连续环境导航中,指令与低层级控制动作之间存在显著的语义鸿沟,这给导航策略的学习带来了挑战。为了解决这一问题,本工作借鉴了子目标推理机制,该机制将复杂的任务分解为一系列可管理的子目标,并通过逐步实现子目标来完成整个任务,有助于提高任务解决的灵活性与学习效率。基于这一理论,本工作提出了一种进化拓扑规划算法。该算法利用拓扑地图将高维的连续环境抽象为低维的图表示,并把导航显式地解耦为高层级的路径规划与低层级的控制执行。路径规划阶段,机器人结合指令在拓扑地图中推理子目标,并制定子目标路径规划;控制阶段通过灵活的启发式算法驱动机器人到达每个子目标。该方法显著地改善了连续环境视觉语言导航任务的性能,增加了导航行为的可控性与学习效率。

本研究在多个视觉语言导航数据集上进行了广泛的实验和分析,实验结果证明了上述方法在提升数据利用率、泛化能力和导航行为可控性方面的有效性,为推理机制启发的视觉语言导航研究提供了新的思路和方法。

Other Abstract

The task of Vision-Language Navigation (VLN) aims to enable autonomous robot navigation in new environments via natural language instructions. With the rise in labor costs, the intensification of population aging, and the increase in the variety of diseases, VLN has drawn widespread attention as a foundational task in Embodied AI. Existing data-driven implicit learning algorithms have made significant progress in VLN. However, due to data scarcity and task complexity, these methods still face challenges in the generalization and controllability of navigation policy. This research aims to model the complex cross-modal reasoning mechanism in VLN explicitly, reducing the dependence on data and achieving interpretable, controllable navigation behaviors to better generalize in new environments.

 

For instruction-following navigation in complex environments, robots first need to reason about complex entity relationships, establishing diverse semantic associations between objects, objects and environments, and objects and the robot's body. Secondly, robots should have the capability to understand and reason about indoor environment layouts, forming memories and space cognitions to support behaviors such as path planning. Lastly, robots should have the synergy of planning and control, establishing an efficient mapping between observations and action execution, which can enhance the controllability and learning efficiency of navigation behaviors. To address the drawbacks of existing methods, this research draws on theories of modular reasoning, hierarchical cognitive maps, and subgoal reasoning in cognitive science, and progressively conducts the following three works.

  • In VLN, the combination of instructions and environmental entities is complex and diverse. Most existing methods learn cross-modal entity associations implicitly, leading to limited generalization in new environments or new entity combinations. To address this issue, this work draws on the modular cognitive theory, which posits that the brain processes different types of information through different functional modules and completes complex cognitive tasks through the synergy of these modules. Based on this theory, this work proposes a modular Neighbor-view Enhanced Model. By inducing the types of phrases in instructions, long instructions are decoupled into various phrase representations and sent to different semantic modules for processing. The outputs of these modules are then sent to a fusion network to adaptively integrate the reasoning results of different modules, forming an overall navigation decision. Through this modular approach, robots can effectively understand and process complex entity relationships, enhancing the generalization capability in new environments.
  • The modular reasoning method mentioned above enhances the robot's ability to understand complex entity combinations, but robots still fall short in spatial cognition, especially when instructions involve cross-view spatial relationship reasoning. To address this issue, this work draws on cognitive map theory, which posits that the brain's place cells and grid cells are the neuronal bases of cognitive maps, interacting hierarchically to promote spatial understanding and path planning. Based on this theory, this work proposes a hybrid map-based pre-training algorithm. The hybrid map consists of a local metric map and a global topological map, where the metric map is akin to place cells, responsible for local spatial reasoning, and the topological map is akin to grid cells, responsible for global navigation planning. A map-based pre-training scheme is further designed to learn a unified multimodal map representation, achieving an effective balance between long-range planning and short-term spatial reasoning through the intra-synergy of the hybrid map.
  • The above methods effectively enhance the robot's entity relationship modeling and spatial reasoning capabilities. However, in practical continuous environments, there is a significant semantic gap between instructions and low-level control actions. To address this issue, this work draws on the subgoal reasoning mechanism, which decomposes complex tasks into a series of manageable subgoals and completes the entire task by gradually achieving subgoals, leading to flexible and efficient task-solving. Based on this theory, this work proposes an Evolving Topological Planning algorithm. This work builds a topological map to abstract the high-dimensional continuous environment into a low-dimensional graph representation, explicitly decoupling navigation into high-level planning and low-level control. In the planning phase, the algorithm conducts cross-modal reasoning in the map to formulate a subgoal path planning; in the control phase, a flexible heuristic controller drives the robot to reach each subgoal. This method significantly improves the performance of VLN in Continuous Environments task, increasing the controllability and learning efficiency of navigation behaviors.

Through extensive experiments and analyses on multiple VLN datasets, this research demonstrates the effectiveness of the above methods in data efficiency, generalization, and controllability of navigation behaviors. This research provides new insights and techniques for VLN research inspired by reasoning mechanisms.

Keyword视觉语言导航 模块化推理 认知地图 子目标导航
Language中文
IS Representative Paper
Sub direction classification机器人感知与决策
planning direction of the national heavy laboratory多模态协同认知
Paper associated data
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/56503
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
安东. 推理机制启发的视觉语言导航[D],2024.
Files in This Item:
File Name/Size DocType Version Access License
学位论文-安东.pdf(10930KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[安东]'s Articles
Baidu academic
Similar articles in Baidu academic
[安东]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[安东]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.