基于事件相机的视觉感知技术研究

doi:无

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于事件相机的视觉感知技术研究
其他题名	无
	谢楚云
出处	无
ISSN	无
	2024-05
卷号	无
期号	无
页数	82
学位类型	硕士
中文摘要	自动驾驶系统的出现为人类的交通出行提供了新的方式。近年来，随着传感器和深度学习算法的快速发展，自动驾驶技术的准确性和可靠性得到了显著提升，使得自动驾驶系统在城市道路、高速公路以及复杂环境下的应用成为可能。视觉感知任务，例如语义分割、目标检测、深度估计等，作为自动驾驶系统中的一个基本模块，对于车辆正确理解和响应周围环境至关重要。目前自动驾驶系统大多数使用传统的可见光相机作为视觉传感器。虽然传统相机能够为周围环境提供丰富的颜色、纹理信息，但是受相机的固有特性影响，在高速运动或极端光照等场景下（例如道路中突然出现的行人或车辆离开隧道时产生的过度曝光），图像会出现运动模糊或者视觉信息受损的现象，从而进一步导致感知任务性能下降。近年来，新型的事件相机传感器引起了广泛关注，因其与传统相机不同的工作原理而具有独特的优势。相较于传统相机只能以固定的帧率同步输出图像，事件相机可以实时检测每个像素位置的光强变化，能够以微秒级的时间分辨率快速精准地响应场景中的动态变化。此外，该传感器还具有高动态范围、低功耗、低传输带宽等优点。因此，在感知任务场景下，事件相机可以实时检测和跟踪视觉范围内的运动物体，对于极端光照变化也有一定的鲁棒性，可以进一步提升自动驾驶系统在复杂场景下运行的可靠性。由于传统图像与事件数据之间存在显著差异，基于图像的感知算法难以直接应用到事件感知领域。目前基于事件的感知算法研究还处于初级阶段，面临事件数据稀疏且缺乏场景细节信息、事件数据采集与标注成本高昂以及数据噪声对算法性能产生负面影响等挑战。针对上述问题，本文深入研究了基于事件相机的视觉感知技术，主要的贡献如下： 1. 为解决事件数据缺乏场景细节信息的问题，本文提出了一种基于注意力正交融合事件与图像的运动目标检测方法。该方法基于多层级的特征编码器，引入基于注意力的正交融合模块。该融合模块能够自适应地学习事件数据中，与图像信息互补的部分特征，并将其融合到图像特征中。随后将融合的多模态特征输入解码器网络以生成运动目标预测结果，通过计算损失完成网络的学习。实验结果表明利用传统图像与事件数据的互补优势，能够补偿并强化场景中的动、静态视觉信息，从而显著提高了模型对于运动目标的检测性能，也保证了算法在不同光照条件下的泛化能力。 2. 为解决事件数据标注困难的问题，本文提出了一种基于双向域转换的事件自标记方法。该方法能够在与事件逐像素配对图像缺失或信息损坏的情况下，自动生成可用于网络学习的事件自监督信号，从而克服了现有的事件标记方法对逐像素配对图像的依赖。方法设计了源域初始化和目标域自标记两个阶段，涉及到图像域与事件域之间的双向转换，随后通过加权推理使得两条转换路径的推理结果相互补偿，以提高生成“伪标签”的质量。实验结果表明使用双向域转换可以利用到不同域的数据特性和信息，有效地提升了事件自标记方法的准确度和可靠性。 3. 为解决事件数据稀疏且夹杂噪声的问题，本文提出了一种基于注意力软对齐与跨模态学习的语义分割方法。该方法基于注意力机制设计了一种三分支权重共享的跨模态学习网络，利用交叉注意力模块软对齐来自图像域和事件域的特征。同时，为了构造网络的数据输入，设计了一种匹配事件与图像的“伪对”构造方法，从而消除了以往依赖逐像素配对的图像帧的限制。进一步地，在三分支网络中引入知识蒸馏，将图像域的细粒度知识通过相似度加权传递到事件域，作为软标签指导事件领域的学习。实验结果表明软对齐能够抑制噪声对模型性能的负面影响，通过跨模态学习有效地拓展了事件的可学习空间，从而使得方法在事件语义分割任务上的准确度和鲁棒性均显著提升。
英文摘要	The emergence of autonomous driving systems has provided a new way of transportation for humanity. In recent years, with the rapid development of sensors and deep learning algorithms, the accuracy and reliability of autonomous driving technologies have been significantly enhanced, enabling the application of autonomous driving systems in urban roads, highways, and complex environments. Visual perception tasks, such as semantic segmentation, object detection, depth estimation, etc., as a fundamental module in autonomous driving systems, are critical for vehicles to correctly understand and respond to their surroundings. Currently, most autonomous driving systems utilize traditional RGB cameras as visual sensors. Although traditional cameras provide rich color and texture information about the environment, their inherent characteristics can lead to motion blur or loss of visual information in scenarios of high-speed movement or extreme lighting conditions (e.g., sudden appearance of pedestrians on the road or overexposure when exiting tunnels), further degrading the performance of perception tasks. In recent years, event cameras, with their unique advantages over traditional cameras, have garnered widespread attention. Compared to traditional cameras that only output images at a fixed frame rate, event cameras can detect changes in light intensity at each pixel location in real-time, offering rapid and precise response to dynamic changes in the scene with microsecond-level time resolution. Additionally, this sensor also features a high dynamic range, low power consumption, and low bandwidth. Therefore, in perception tasks, event cameras can detect and track moving objects within the visual range in real time and demonstrate robustness to extreme light changes, thereby enhancing the reliability of autonomous driving systems in complex scenarios. Due to significant differences in the form of traditional images and events, perception algorithms based on images are challenging to directly apply to the events. However, current perception algorithms based on events are still in the early stages of research, facing challenges such as the sparsity of events and lack of detailed scene information, high costs of event data collection and annotation, and the negative impact of data noise on algorithm performance. In response to these issues, this paper delves into the technology of visual perception based on event cameras, with the following main contributions: 1. To address the lack of scene detail information in events, this paper proposes a moving object detection method that fuses events and images through an attention-based orthogonal fusion module. Utilizing multi-level feature encoders, this method introduces an attention-based orthogonal fusion module that adaptively learns features in events complementary to images, and integrates them into image features. The fused multimodal features are then input into a decoder network to generate moving object prediction results, with the network learning completed through loss calculation. Experimental results demonstrate that leveraging the complementary advantages of traditional images and events can enhance the perception of dynamic and static visual information in the scene, significantly improving model performance in moving object detection and ensuring algorithm generalization under varying lighting conditions. 2. To solve the difficulty of event labeling, this paper presents a bi-directional domain conversion-based self-labeling method for events. This method can automatically generate self-supervised signals for network learning in scenarios where per-pixel paired images are missing or damaged, thereby overcoming the dependency of existing event labeling methods on per-pixel paired images. The method includes Source-Path Initialization and Target-Path Labeling stages, involving bi-directional conversion between the image and event domains. Subsequently, weighted inference integrates the inference results of both conversion paths to enhance the quality of generated "pseudo-labels". Experimental results show that bi-directional domain conversion can utilize data characteristics and information from different domains, effectively improving the accuracy and reliability of the event self-labeling method. 3. To address the issue of sparse event data mixed with noise, this paper introduces a semantic segmentation method based on attentional soft alignment and cross-modal learning. This method designs a three-branch weight-sharing cross-modal learning network based on the attention mechanism, with a cross-attention module softly aligning features from the image and event domains. Additionally, in order to build the network input data, the pseudo-pair construction method for matching events with images was designed to eliminate the dependency on per-pixel paired images previously required. Further, knowledge distillation is introduced into the three-branch network, transferring fine-grained knowledge from the image domain to the event domain as soft labels to guide learning in the event domain. Experimental results indicate that soft alignment can mitigate the negative impact of noise on model performance, and cross-modal learning effectively expands the learnable space for events, significantly enhancing accuracy and robustness in event semantic segmentation tasks.
关键词	事件相机,视觉感知,无监督域自适应,多模态融合,跨模态学习
WOS标题词	无
学科领域	计算机科学技术
学科门类	工学
DOI	无
WOS关键词	无
URL	查看原文
收录类别	其他
语种	中文
WOS研究方向	无
WOS类目	无
WOS记录号	WOS:无
CSCD记录号	CSCD:无
出版者	无
EI入藏号	无
EI主题词	无
EI分类号	无
是否为代表性论文	否
七大方向——子方向分类	多模态智能
国重实验室规划方向分类	多模态协同认知
是否有论文关联数据集需要存交	否
引用统计
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/56673
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	谢楚云. 基于事件相机的视觉感知技术研究[D],2024.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
硕士论文-谢楚云-5.21.pdf（10360KB）	学位论文		限制开放	CC BY-NC-SA