跨模态数据引导的视觉场景分割

CASIA OpenIR > 毕业生 > 硕士学位论文

	跨模态数据引导的视觉场景分割
	甘睿彤
	2023-05-20
页数	98
学位类型	硕士
中文摘要	视觉场景分割任务作为计算机视觉领域中的一个感知任务分支，有着非常重要的研究意义。视觉场景分割通常需要将图像中包含的物体或语义进行像素级别的解析与判定归类，进而使机器能够对图像中目标的类别、形状、边界、位置等信息进行掌握，在如自动驾驶算法、遥感图像分析等相关领域都有着非常重要的研究意义与应用价值。注意到，真实场景中常常存在除视觉图像信息之外的多模态信息可供利用，如声音、距离、语言文本、人机交互数据等等。利用这些跨模态数据信息，研究者可以综合多种模态的信息进行交互印证，训练模型算法使其能够接受视觉模态之外的跨模态信息输入，综合理解模态信息间的信息关联，来达到单一模态信息无法传达的复杂场景的理解能力。为此，近年来，研究者们对于视觉场景分割任务中的跨模态数据应用进行了探索与尝试，在模型训练和测试时提供图像信息之外的模态信息作为额外的信息输入分支来给予模型数据支持。与传统单模态的场景分割任务相比，跨模态数据引导的视觉场景分割的研究侧重点则主要聚焦在每个模态信息的不同表达方式，以及跨模态信息间的数据沟通及特征融合。本文的研究工作探索了两种不同的跨模态信息在视觉场景分割任务中的具体应用，探索如何获得更完整的跨模态数据信息表达，模态间信息的融合方式，以及尝试解决跨模态模型在开放真实场景下的泛化、应用、部署时存在的问题。本文主要创新点包括： 1. 基于跨模态点交互信息的视觉场景分割：以用户提供的点交互坐标信息为人机交互数据，提供目标分割的先验引导，实现高精度的实例目标分割模型。在该任务的基础上，通过对已有方法的局限性分析，提出并构建了一种终身学习的框架，自动收集用户提供的点交互数据，通过发掘交互信息中包含的隐含特征，使模型在摆脱对像素级精细化标签依赖的同时实现迭代进化，为模型在真实开放环境下的部署提出了一种可行的实际方案。 2. 基于跨模态文本描述信息引导的视觉场景分割：以文本描述信息为跨模态数据，引导并完成场景目标分割任务，实现模型对自然语言和场景图像两种模态特征的理解与对齐。通过基于图像特征上下文信息辅助定位优化文本特征的框架方法，以及综合利用数据集中对相同目标与不同目标的复数对文本描述间的互相约束，解决了由低质量跨模态文本信息引导时带来的目标匹配错误这一关键问题，取得了行业中前列的模型性能表现。总的来说，本文针对跨模态数据引导的视觉场景分割，分别探索了两种不同的跨模态数据在视觉分割任务中的影响与实际应用。第一种跨模态信息为用户点交互信息，由用户在图像层级上进行交互介入提供前背景的点击指引，该模态 I 跨模态信息引导的视觉场景分割的数据主要携带由用户判定的目标物体的前背景位置信息指示，以及该物体在图片中的位置感知范围，能够与视觉模态信息较为直观地融合特征表达；第二种跨模态信息为文本描述信息，与前一种模态信息相比则更加贴近真实世界中的跨模态信息形式，主要通过对目标物体的位置、外观、行为等细节描述来指引模型在视觉模态上进行定位，与视觉模态信息有着较大的特征表达差异，对模型理解来说更具有挑战性，也更贴近真实场景的应用。实验表明，本文所提出的对上述两种跨模态信息引导的视觉场景分割方法与同期工作相比，均有较为显著的性能提升，在领域内基准数据集上达到领先水平，且方法切实解决了已有工作中存在的不足，并且试图解决了在开放真实场景中部署模型所存在的一定问题，脚踏实地对问题研究的落地进行了尝试，具有较好的学术创新以及实际应用价值。
英文摘要	The visual scene segmentation task, as a branch of perception tasks in the field of computer vision, has very important research significance. Visual scene segmentation usually requires pixel level analysis and classification of the objects or semantics contained in the image, so that the machine can grasp the category, shape, boundary, position and other information of the objects in the image. It has very important research significance and application value in related fields such as autonomous driving algorithms and remote sensing image analysis. Note that in real scenes, there is often multimodal information available beyond visual image information, such as sound, distance, language text, human-computer interaction data, and so on. Using these cross modal data information, researchers can integrate the information of multiple modes for interactive verification. The training model algorithm enables them to accept the cross modal information input outside the visual mode, and comprehensively understand the information association between modal information, so as to achieve the understanding ability of complex scenes where a single modal information cannot be conveyed. In recent years, researchers have explored and attempted the application of cross modal data in visual scene segmentation tasks, providing modal information beyond image information as additional input branches to support model data during model training and testing. Compared with traditional single modal scene segmentation tasks, the research focus of cross modal data guided visual scene segmentation mainly focuses on the different expressions of each modal information, as well as data communication and feature fusion between cross modal information. The research work of this article explores the specific applications of two different types of cross modal information in visual scene segmentation tasks, explores how to obtain more complete cross modal data information expression, the fusion of information between modalities, and attempts to solve the problems of generalization, application, and deployment of cross modal models in open real scenes. The main innovation points of this article include: 1. Visual scene segmentation based on cross modal point interaction information: take the point interaction coordinate information provided by the user as humancomputer interaction data, provide a priori guide for target segmentation, and achieve high-precision instance target segmentation model. On the basis of this task, through the analysis of the limitations of existing methods, a lifelong learning framework is proposed and constructed to automatically collect the point interaction data provided by III Cross-Modal Data-Driven Visual Scene Segmentation users. By exploring the implicit features contained in the interaction information, the model can achieve iterative evolution while getting rid of the dependency on pixel level refined labels, and a feasible practical scheme is proposed for the deployment of the model in a real open environment. 2. Visual scene segmentation guided by cross modal text description information: Using text description information as cross modal data, guide and complete the task of scene object segmentation, and achieve the model’s understanding and alignment of natural language and scene image modal features. By using a framework method based on image feature contextual information to assist in locating and optimizing text features, as well as comprehensively utilizing the mutual constraints between complex text descriptions of the same and different targets in the dataset, the key problem of target matching errors caused by low-quality cross modal text information guidance has been solved, achieving industry-leading model performance. Overall, this article explores the impact and practical application of two different types of cross modal data in visual scene segmentation guided by cross modal data. The first kind of cross modal information is user point interaction information. The user interacts at the image level to provide the front background click guidance. The data of this mode mainly carries the front background position information indication of the target object determined by the user, as well as the position perception range of the object in the picture, which can be more intuitively fused with the visual modal information for feature expression; The second type of cross modal information is text description information, which is more closely related to the real world cross modal information form compared to the previous type of modal information. It mainly guides the model to locate in visual mode by describing the position, appearance, behavior, and other details of the target object. It has a significant feature expression difference from visual mode information, which is more challenging for model understanding and closer to the application of real scenes. The experiment shows that the visual scene segmentation methods proposed in this article for the above two cross modal information guidance have significant performance improvements compared to the same period of work, reaching a leading level on the benchmark dataset in the field. The method effectively solves the shortcomings of existing work and attempts to solve certain problems in deploying models in open real scenes. A down-to-earth attempt is made to implement the problem research, It has good academic innovation and practical application value.
关键词	视觉场景分割跨模态数据引导跨模态特征融合语义分割
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
国重实验室规划方向分类	脑启发多模态智能模型与算法
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51694
专题	毕业生_硕士学位论文
通讯作者	甘睿彤
推荐引用方式 GB/T 7714	甘睿彤. 跨模态数据引导的视觉场景分割[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
甘睿彤终版.pdf（4856KB）	学位论文		限制开放	CC BY-NC-SA