面向救援场景的视触融合物体识别

CASIA OpenIR > 毕业生 > 硕士学位论文

	面向救援场景的视触融合物体识别
	张源培
	2024-05-13
页数	96
学位类型	硕士
中文摘要	在救灾领域，环境条件的复杂多变性以及伴随的高风险性使得快速、高效地展开救援行动尤为关键。在此背景下，智能化救援系统的应用具有显著优势，其不仅能够替代人类执行高风险救援任务，降低人员伤亡风险，而且通过实时提供受困者位置、火势范围等关键信息，为现场救援人员提供精准的决策支持。智能化救援系统的实现离不开其卓越的环境感知性能。在结构复杂、物理交互频繁的救援场景中，视觉、触觉融合是推动救援系统迈向更高程度自主化和智能化的重要途径。结合视觉传感的全局场景解析能力和触觉传感的局部精细操作特性，视触融合技术能够为救援任务提供更为全面、稳健的状态感知机制。通常情况下，救援系统执行任务时大致可以分为两个阶段：检测目标物体位置、识别目标物体种类。本文聚焦于这两个阶段的视触感知难题，分别在视触觉救援场景数据集构建、视觉感知目标物体位置、触觉感知目标物体属性、视触觉识别目标物体种类展开讨论，具体包括以下内容： 1. 构建首个面向救援场景的视触数据集。当前，尚未有面向救援场景的视触专有数据集，这在很大程度上限制了救援场景视触联合感知技术的发展。针对该问题，本文综合分析了智能救援系统执行任务时可能面临的一些困难，并针对这些困难分别构建了救援场景深度数据集和救援场景视触联合识别数据集。此外，为了训练鲁棒的触觉属性识别算法，从而设计高效、鲁棒、泛化能力强的触觉特征提取骨干网络，本文还制作了材质属性识别数据集。 2. 提出一种基于扩散模型的深度补全方法，并基于补全后的深度图完成目标物体的位置检测。在执行救援任务时，首先要获取目标物体的位置并逐渐靠近物体。针对该任务需求，本文用深度相机捕获前景与背景的深度信息，然后借助深度补全算法来对低质量深度图进行细化操作，最后依据补全后的深度图完成物体检测。本文在已有的基于编码器-解码器结构的方法框架上引入隐空间扩散模块，利用扩散模型强大的细化能力提高深度补全的精度，进而提升物体定位的效果。在多个深度补全数据集上的实验结果表明，隐空间扩散模块显著提升了深度补全的性能指标。 3. 依托属性识别任务提出一个基于注意力机制的触觉特征提取骨干网络。在视触联合识别中，触觉是一个不可或缺的子模块。为了设计鲁棒性强、实时性高的触觉特征提取骨干网络，本文借助属性识别任务来设计并检验触觉特征提取模型。该模型由空间编码模块和时序编码模块组成，前者用自注意力机制捕捉触觉序列的长期依赖关系，后者用长短时记忆网络或一维卷积显式编码触觉序列的顺序信息。本文提出的方法在属性识别公开数据集上展现了具有竞争力的效果。 4. 提出多种基于视触觉融合的物体种类识别方案，并搭建一套视触联合感知系统。在复杂救援场景下，运用视触觉融合的方式有利于提高救援系统感知的准确程度。针对该问题，本文提出了特征级融合和决策级融合两套融合方案，其中特征级融合包括特征拼接融合与特征相加融合，决策级融合尝试了视触觉独立决策在最终决策中所占的不同权重。两种多模态融合方式均展现了比单模态更加优异的识别效果。最终搭建的视触联合感知系统可以根据传感器读取的视觉图像和触觉序列，实时输出目标物体的种类及其拥有的属性信息。
英文摘要	In the field of disaster relief, the complexity and variability of environmental conditions, coupled with the associated high risks, make it crucial to deploy rescue operations rapidly and efficiently. In this context, the application of intelligent rescue systems offers significant advantages. These systems not only have the potential to replace humans in executing high-risk rescue tasks, thus reducing the risk of casualties, but also provide real-time critical information such as the location of stranded individuals and the extent of the fire, enabling on-site rescue personnel to make precise decisions. Achieving intelligent rescue systems depends heavily on their outstanding environmental perception capabilities. In complex and physically interactive rescue scenarios, the fusion of visual and tactile senses is an important approach to advancing rescue systems towards higher levels of autonomy and intelligence. Combining the global scene understanding capabilities of visual sensing with the fine-grained manipulation characteristics of tactile sensing, visual-tactile fusion technology can provide a more comprehensive and robust state perception mechanism for rescue tasks. Typically, rescue systems can be roughly divided into two stages when performing tasks: detecting the position of target objects and identifying the types of target objects. This thesis focuses on the visual-tactile perception challenges in both two stages, discussing the construction of datasets for rescue scenarios, detection of object positions through visual perception, perception of object attributes through tactile perception, and recognition of object types through visual-tactile perception. Specifically, the discussion includes the following contents: 1. Constructing the first visual-tactile dataset for rescue scenarios. Currently, there is no specialized dataset for visual-tactile perception in rescue scenes, which greatly limits the development of visual-tactile joint perception technologies in rescue scenarios. To address this issue, this thesis comprehensively analyzes some of the challenges that intelligent rescue systems may face during tasks, and then constructs Rescue Depth Dataset and Rescue Visual-Tactile Fusion Dataset for these problems. Additionally, to train robust tactile attribute recognition algorithms and design efficient, robust, and highly generalizable tactile feature extraction backbone networks, this thesis also creates Haptic Adjective Recognition Dataset. 2. Proposing a depth completion method based on the diffusion model and detecting the object position based on the completed depth map. When performing rescue tasks, the first step is to obtain the position of the target object and gradually approach it. This thesis uses a depth camera to perceive depth information of foreground and background and then completes the depth map using a depth completion algorithm. Finally, object detection is performed based on the completed depth map. This thesis introduces a latent space diffusion module into the existing encoder-decoder-based depth completion architecture, leveraging the powerful refinement capability of the diffusion model to enhance the accuracy of depth completion, thereby improving the effectiveness of target object localization. The experimental results on multiple depth completion datasets indicate that the latent space diffusion module significantly improves the performance metrics of depth completion. 3. Introducing a tactile feature extraction backbone network based on the attention mechanism relying on the attribute recognition task. In visual-tactile joint recognition, tactile sensing is an indispensable submodule. In order to design a tactile feature extraction model with strong robustness and real-time performance, this thesis leverages attribute recognition tasks to design and verify tactile feature extraction models. The model consists of a spatial encoding module and a temporal encoding module, where the former captures the long-term dependency of tactile sequences using the self-attention mechanism, and the latter learns the sequential information of tactile sequences using the Long Short-Term Memory or one-dimensional convolution. The method proposed in this thesis demonstrates competitive performance on publicly available datasets for adjective recognition. 4. Experimenting with various visual-tactile fusion schemes and constructing a visual-tactile joint perception system. In complex scenarios, employing visual-tactile fusion undoubtedly increases the accuracy of rescue system perception. This thesis proposes two fusion schemes: feature-level fusion and decision-level fusion. Feature-level fusion includes feature concatenation fusion and feature addition fusion, while decision-level fusion attempts different weights for visual-tactile independent decisions in the final decision. Both multimodal fusion approaches demonstrated superior recognition performance compared to single-modal methods. The constructed visual-tactile joint perception system can output the types and attributes of target objects in real-time based on collected visual images and tactile sequences.
关键词	视触融合救援场景深度补全触觉感知
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56510
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	张源培. 面向救援场景的视触融合物体识别[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
张源培__面向救援场景的视触融合物体识别（21948KB）	学位论文		限制开放	CC BY-NC-SA