面向多语义和多模态的视觉目标检测研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向多语义和多模态的视觉目标检测研究
	杨力
	2023-05-24
页数	138
学位类型	博士
中文摘要	目标检测是计算机视觉领域最为基础和重要的研究方向之一，其主要任务是从输入图像中识别和定位出感兴趣的物体。目标检测算法是许多计算机视觉任务的核心方法和基础，并广泛应用于自动驾驶、安防监控、人脸识别、医学图像等众多领域，因此受到了广泛的研究和关注。近年来随着深度学习技术的发展，目标检测领域取得了长足的进步，其应用场景和功能界限也在不断拓展。在现实场景中，一个物体往往可以被多种语义和模态所表示，即它既可以由类别、属性、环境上下文等多层级语义信息进行定义和描述，也能以文本、图像、三维点云等多种维度和粒度的数据形式呈现。研究多语义和模态表示下的目标检测问题，一方面可以拓宽物体感知的语义空间，另一方面它能够适用于不同的数据场景，从而丰富目标检测的应用方式和范畴，具有重要的研究意义。因此，本文围绕多语义和多模态的目标检测问题，分别从“目标检测的分类和定位解耦优化”、“目标开放语义信息的理解感知”、“目标三维上下文信息的显式学习”三个方面展开了方法研究。本文的主要贡献概括如下： 1. 提出了基于预测解耦机制的单阶段目标检测方法本文首先分析了传统单阶段检测方法，发现模型进行物体分类和定位预测的最佳位置通常不同，所以从同一格点位置对目标进行分类和定位容易导致次优结果。针对这一问题，本文提出一种基于预测解耦机制的目标检测算法。该方法将对物体的预测目标分解为物体类别和四边界位置，由网络在不同位置进行编码学习。同时，本文设计了可学习的预测收集模块，能够灵活地从不同位置收集并聚合分类和定位预测，实现不同预测目标（即物体类别和边界位置）的推理过程解耦。而且，本文还设计了两步生成策略来学习构建两组动态点，即动态边界点和动态语义点，以建模更好感知物体边界或语义信息的位置，用于指导更优分类和定位预测的收集和聚合。本文提出的预测解耦机制具有极小的计算开销，在保持模型高效推理能力的同时显著提高了模型的目标检测性能。 2. 提出了基于视觉-语言验证和迭代推理的指代目标检测方法针对目标开放语义信息的理解感知，本文提出了一种基于视觉-语言验证和迭代推理的指代目标检测框架。该框架在特征建模和推理过程通过充分地建模视觉和文本之间的语义关联，以完成对文本所指代目标物体的检测。具体而言，在特征建模过程中，本文设计了一个视觉-语言验证模块，通过对视觉和文本模态特征之间的细粒度相关性进行验证，使视觉特征关注于与文本语义信息相关联的区域，减少无关物体或区域的干扰。同时，模型引入了一个语言引导的上下文编码器，根据文本提供的描述信息，来聚合所指代目标的上下文特征，提高目标的特征判别性。最后，本文构建了一个多级跨模态解码器，对文本和视觉特征进行迭代的查询和推理，从而准确关注到文本所指代的目标并完成目标定位。本文通过在特征建模和推理过程中充分关联视觉和语言的语义信息，能够显著提高指代目标检测性能，在多个公开数据集上取得了领先结果。 3. 提出了基于上下文物体和关系学习的三维指代目标检测方法本文提出了一种基于上下文物体和关系学习的指代目标检测框架，用于三维场景中的开放语义目标检测。该框架在检测过程中聚焦于对齐文本和视觉中的物体和上下文信息，以实现准确的指代目标检测。具体而言，本文首先设计了基于文本的物体检测网络，并提出一种伪标签自生成与学习策略，使物体检测网络能够根据文本信息同时实现对目标物体和上下文物体的检测学习。基于物体检测结果，本文构建了多种空间关系特征作为初始特征，并输入到上下文关系匹配网络中，该网络利用弱监督学习方式对文本语义相关的上下文关系特征进行匹配和学习。最后，本文构建了一个基于上下文的目标推理网络，对物体检测结果引入上下文关系特征，并与文本中关于目标及其上下文的描述信息作细粒度对齐，从而准确地推断出指代目标物体。本文通过大量实验验证了基于上下文学习方法的有效性，并在多个公开数据集上取得了领先的性能。
英文摘要	Object detection is one of the most fundamental and important research areas in computer vision, which aims to recognize and locate objects of interest from input images. Object detection algorithms are core methods and foundations for many computer vision tasks and are widely used in many fields such as autonomous driving, security monitoring, face recognition, and medical imaging, and thus have attracted extensive research and attention. Recent advancements in deep learning technology have led to significant progress in the field of object detection, and its functional boundaries and application scenarios continue to expand. In real-world scenarios, an object can be represented by multiple semantics and modalities. That is, it can be defined and described by multi-level semantic information such as category, attribute, environmental context, and can also be presented in various data forms such as text, image, and 3D point cloud with different dimensions and granularities. Research on object detection under multi-semantic and multi-modal representations can not only broaden the semantic space of object perception but also enrich the application methods and scope of object detection, offering significant research significance. Therefore, this thesis focuses on object detection for multiple semantics and modalities, exploring three methodological aspects: “decoupled optimization of classification and localization in object detection”, “understanding and perception of open semantic information of objects”, and “explicit learning of 3D contextual information of objects”. The main contributions of this thesis are summarized as follows: (1) We propose a one-stage object detection method based on prediction decoupling. We first analyze conventional one-stage detection methods and find that the most suitable locations for object classification and localization are generally different. Therefore, classifying and localizing an object from the same grid location may lead to suboptimal results. To address this problem, we propose a novel object detection method based on a prediction decoupling mechanism. Our method separates the prediction targets as the object category and four sides of the object bounding box, which are separately encoded at different locations by the network. To obtain the final detection results, we devise a learnable prediction collection module that can flexibly collect and aggregate classification and localization predictions from different locations, thereby achieving the decoupling of the inference process for different targets (i.e., object categories and boundaries). Moreover, we propose a two-step generation strategy to learn two sets of dynamic points, i.e., dynamic boundary points and semantic points, to model the positions with better perceptions of object boundaries or semantic information. These dynamic points are used to guide the collection and aggregation of better classification and localization predictions. The proposed prediction decoupling mechanism incurs almost negligible computational overhead and significantly improves object detection performance while maintaining high inference efficiency. (2) We propose a visual grounding method based on visual-linguistic verification and iterative reasoning. For open semantic object detection, we present a novel visual grounding framework based on visual-linguistic verification and iterative reasoning. The framework is designed to comprehensively model the semantic correlation between vision and text for more discriminative identification of target objects. Specifically, we devise a visual-linguistic verification module to compute the fine-grained correlation between visual and textual features. This enables visual features to focus on the regions related to the semantics in the textual description, while reducing distractions from irrelevant objects or regions. Moreover, we introduce a language-guided context encoder to gather context features for the referred object under the guidance of the textual descriptions, improving the object’s distinctiveness. Finally, we develop a multi-stage cross-modal decoder that iteratively queries and mulls over the textual and visual features to accurately recognize and localize the target object. By fully exploiting the semantic correlation of the two modalities during feature modeling and target inference, our method significantly improves visual grounding performance, achieving state-of-the-art results on multiple public datasets. (3) We propose a 3D visual grounding method based on context object and relation learning. For open semantic object detection in 3D scenes, we propose a 3D visual grounding framework based on context object and relation learning. Our framework aims to align the object and context information between vision and text to ensure accurate inference of the target object. To this end, we first devise a text-conditioned object detection network and propose a pseudo-label self-generation and learning strategy that enables the detection network to learn the detection of both target and context objects based on textual information. Subsequently, we establish a variety of spatial relation features from the detected objects and input them into a context relation matching network. This matching network leverages a weakly supervised method for matching and learning context relation features related to text semantics. Finally, we develop a target inference network that introduces context relation features to the detected objects and matches them with the textual information at a fine-grained level to accurately infer the target object. Through extensive experiments, we validate the effectiveness of our context-learning method and demonstrate leading performance on multiple public datasets.
关键词	目标检测指代目标检测多语义多模态视觉-语言
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
国重实验室规划方向分类	多模态协同认知
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52122
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	杨力. 面向多语义和多模态的视觉目标检测研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
杨力_面向多语义和多模态的视觉目标检测研（19168KB）	学位论文		限制开放	CC BY-NC-SA