基于图像与点云的视觉场景语句描述方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 先进时空数据分析与学习

	基于图像与点云的视觉场景语句描述方法研究
	于强
	2021-09-16
页数	140
学位类型	博士
中文摘要	随着互联网的快速发展和移动智能设备、三维扫描设备的逐渐普及，以图像和点云为主的视觉数据正在以爆发式速度不断增长。如何挖掘视觉数据中的有效信息已成为当前亟待解决的问题。作为一种重要的视觉场景理解方法，视觉场景语句描述的任务是对给定的视觉数据（如图像、点云）生成一段描绘视觉场景中所关注内容的自然语句。但是，视觉数据量庞大，视觉内容关系错综复杂；同时，视觉与文本存在跨模态语义差异。因此，视觉场景语句描述存在诸多需要进一步解决的技术难题。面向图像和点云数据，本文研究单模态和多模态融合条件下的视觉场景语句描述方法，为相应条件下的相关应用提供技术思路。针对视觉场景语句描述中的难点问题，分别从图像场景实体/抽象概念属性提取、点云卷积操作构造和“图像-点云”语义融合的角度，结合自然语言模型，开展视觉场景语句描述深度学习模型构造、模型训练与实验验证等工作。具体地，本文的主要贡献包括以下三点： 1. 提出一种基于特征精炼的图像属性提取模型。该模型包含三个改进的模块：属性精炼模块、单词树状结构模块和特征增强模块。其一，属性精炼模块将已有“名词”（视觉实体）属性特征和卷积视觉特征通过非线性方式映射为“非名词”（抽象概念）属性特征；其二，单词树状结构模块通过树结构将同义名词属性特征映射为相似的属性概率值，从而消除在自然语言层面的语义歧义；其三，特征增强模块在不同尺度的图像特征中检测视觉属性，并为语句生成模型提供更准确的属性值。上述三个模块的联合应用提升了图像语句描述模型的精度。对比实验验证了所提模型的有效性。 2. 提出一种基于密集点卷积操作与多任务学习的点云语句描述模型。该模型首先引入点云卷积网络提取点云高层次抽象视觉特征；然后，构建Transformer编解码器架构，并利用该架构将视觉特征映射为描述语句。同时，为降低视觉特征学习的难度，引入多任务参数共享机制，在多任务学习的框架下联合优化点云语义分割任务和描述语句生成任务。点云语义分割的引入，辅助提升了所构建语句描述模型的特征学习能力，加快了收敛速度；同时，抑制了过拟合问题。最后，针对目前鲜有公开的大规模场景点云语句描述数据集的情形，构建了两个大规模场景点云语句描述数据集。在公开的数据集和自行标注的数据集上验证了所提模型的有效性。 3. 提出一种基于区域关联与注意力的多模态视觉融合语句描述模型。针对图像和点云数据，首先，基于对应的骨干网络，分别引入区域候选框生成模块、候选框融合模块和池化模块，得到图像和点云目标区域候选框及其定长特征。其次，构建区域关联规则和注意力机制，将图像和点云区域特征进行多层深度融合。最后，构建基于Transformer注意力机制的语句描述生成模块，将多层深度融合的视觉特征序列映射为单词或短语序列。所提模型可直接在点云数据上提取特征，避免了重要数据的丢失；同时基于区域关联规则，增加了融合过程的可解释性。实验验证了所提模型的有效性，且其生成的描述语句质量达到了目前最优水平。
英文摘要	With the rapid development of the Internet and the gradual popularization of three-dimensional scanning devices, the visual data mainly composed of two-dimensional images and three-dimensional point clouds is growing at an explosive speed. How to mine the effective information in visual data has become an urgent problem. As an important visual scene understanding method, visual scene captioning is aimed to generate natural sentences describing the content concerned in the visual scene for given visual data (such as images and point clouds). However, the amount of visual data is huge and the relationship between visual content is complex. At the same time, there are cross modal semantic differences between vision and text. Therefore, there are many technical problems that need to be further solved. For images and point clouds, this dissertation studies visual scene captioning methods under the condition of single-modal and multi-modal vision fusion, and provides technical ideas for related applications under corresponding conditions. Aiming at the difficult problems in visual scene captioning, the deep learning model construction, model training and experimental verification of visual scene captioning models are carried out from the perspectives of image scene entity and abstract concept attribute extraction, point cloud convolutional network construction and "image - point cloud" semantic fusion, combined with natural language model. Specifically, the main contributions of this dissertation include the following three points: 1. An image attribute extraction model based on feature refinement is proposed. This model consists of three improved modules: attribute refinement module, word tree module and feature enhancement module. Firstly, the attribute refinement module maps the existing "noun" (visual entities) attribute features and convolutional features into "non noun" (abstract concepts) attribute features in a non-linear way. Secondly, the word tree module maps the attribute features of synonymous nouns into similar attribute probabilities through the tree structure, so as to eliminate semantic ambiguity at the natural language level. Thirdly, the feature enhancement module detects visual attributes in image features of different scales and provides more accurate attribute features for the caption generation model. The joint application of above three modules improves the accuracy of image captioning models. Comparative experiments verify the effectiveness of the proposed model. 2. A point cloud captioning model based on dense point convolution and multi-task learning is proposed. Firstly, the point cloud convolution network is introduced to extract the high-level abstract visual features of point cloud. Then, the Transformer encoder-decoder architecture is constructed, and the visual features are mapped into caption sentences. At the same time, in order to reduce the difficulty of visual feature learning, multi-task parameter sharing mechanism is introduced to jointly optimize the point cloud semantic segmentation task and caption generation task under the framework of multi-task learning. The introduction of point cloud semantic segmentation improves the feature learning ability of the constructed captioning model and accelerates the convergence. At the same time, the over fitting problem is suppressed. Finally, aiming at the situation that there are few public large-scale point cloud captioning datasets, two large-scale point cloud captioning datasets are constructed. The effectiveness of the proposed model is verified on public datasets and self-labeled datasets. 3. A multi-modal vision fusion captioning model based on region correlation and attention is proposed. For images and point clouds, firstly, based on the corresponding backbone network, the region proposal generation module, proposal fusion module and region pooling module are introduced respectively to obtain the target region proposals and their fixed-length features. Secondly, region correlation rules and the attention mechanism are constructed to deeply fuse the region features of images and point clouds. Finally, a caption generation module based on the Transformer attention mechanism is constructed to map the deeply fused visual feature sequences into words or phrase sequences. The proposed model can directly extract features from point clouds, avoiding the loss of important data. At the same time, based on region correlation rules, the interpretability of the fusion process is increased. Experiments verify the effectiveness of the proposed model, and the quality of the generated caption sentences has reached the state of the art.
关键词	视觉场景语句描述属性特征提取密集点卷积区域关联多模态视觉融合
语种	中文
七大方向——子方向分类	三维视觉
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/46622
专题	多模态人工智能系统全国重点实验室_先进时空数据分析与学习
通讯作者	于强
推荐引用方式 GB/T 7714	于强. 基于图像与点云的视觉场景语句描述方法研究[D]. 中国科学院大学. 中国科学院大学,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-YuQiang-20211（8236KB）	学位论文		开放获取	CC BY-NC-SA