面向自然人机交互的语言-视觉物体定位方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向自然人机交互的语言-视觉物体定位方法研究
	李钱钟
	2022-05-25
页数	156
学位类型	博士
中文摘要	近年来，人工智能技术的飞速发展带动了人机交互方式的不断变革，由采用鼠标、键盘等设备辅助的传统交互方式，向自然语言、手势、表情等更接近于人与人之间交流的自然人机交互方式发展。凭借表达的便捷性及多样性，服务机器人常采用自然语言作为交互方式，实现自然、舒适的人机交互体验。面对这种交互方式，机器人如何理解语音、文字等内容与场景间的关系，确定交互中涉及的对象，完成对人的服务过程，是提高人与服务机器人之间自然人机交互质量的关键。而语言-视觉物体定位技术的目的是根据自然语言文本对视觉图像中所指代的物体进行定位。为此，本文研究了自然人机交互中的语言-视觉物体定位方法，以提升服务机器人的交互能力。本文通过解决人机交互系统因交互场景变化产生的交互对象类别泛化、交互语句形式由显性指代向隐性指代描述的延展和交互对象由静态定位至动态定位的扩展问题，逐步研究了零样本物体检测、自然语言-图像指代表达理解和自然语言-视频指代表达理解任务，并构建了自然人机交互系统予以应用。论文的主要内容与贡献如下： 1.针对交互场景中未训练物体的检测问题，提出了一种基于软加余量焦点损失（Softplus Margin Focal Loss）的零样本物体检测方法。该方法设计了一种基于一维卷积的类别语义-视觉映射机制，有效缓解了零样本学习中的枢纽度问题，减少了模型参数，并通过解码器的重构损失函数约束编码器的构建过程。针对分类分支中视觉特征和被映射的语义特征在视觉空间的对齐问题，提出了一种软加余量焦点损失函数，在维持焦点损失解决类别不平衡问题特性的同时，增强正负类别映射特征的区分度，区分图像的前景与背景。在此基础上，进一步提出了一种融合语义信息输入的定位分支，并设计了可训练矩阵进行特征对齐。所提方法在四个公共数据集进行了实验，结果验证了方法的有效性。 2.针对自然语言-图像指代表达理解与分割问题，提出了交叉模态协同网络算法。算法采用注意力感知的表示学习模块对图像及语言描述进行模态特征学习，通过引入语言自注意力子模块，建立语言描述的内在关系并学习语言特征；设计了语言引导的通道-空间注意力子模块，突出图像中指代表达相关的区域，抑制背景干扰，获得语言感知的视觉特征。针对模态特征融合问题，设计了一种交叉模态协同模块，在语义和空间维度上构建两模态间的协同关系。在此基础上，基于特征选择策略，提出了一种多尺度特征融合模块，聚合多尺度特征中所指物的相关信息，生成指代预测结果。所提模型在四个公共数据集上进行了实验，结果验证了模型的有效性。 3.针对自然语言-视频指代表达理解问题，提出了一种多阶段的图像-语言交叉生成融合网络方法。该方法设计了帧密集特征聚合模块，通过相邻时序内的视频帧辅助关键帧的特征学习，保证所指物定位在视频前后帧中的一致性。针对特征融合问题，提出了一种图像-语言交叉生成融合模块，将其作为多阶段学习的主体，该模块通过图像-语言相似度生成跨模态特征，并对所得图像、语言模态特征进行精细化融合。为了增强模型的跨模态特征生成能力，设计了所指物定位和语言表达特征的一致性损失函数，对特征生成中的图像-语言相似度和语言-图像相似度矩阵进行约束。所提方法在三个公共数据集上进行了实验，结果验证了方法的有效性。 4.针对自然人机交互问题，设计了一个基于语言-视觉物体定位的自然人机交互系统，并搭建了一个由硬件与软件系统组成的服务机器人交互平台进行系统实现。针对交互系统的交互场景由固定物体类别向新物体类别迁移的问题，利用零样本物体检测方法构建了基于物体类别名匹配的人机交互系统，通过模型由“可见”类别向“未见”类别的性能迁移解决该问题。为了解决以物体类别名匹配确定交互对象的局限性问题，采用直接建立物体与描述语句间关系的方式，构建了基于自然语言-图像指代表达理解的人机交互系统，使交互系统能够处理不涉及物体类别的隐性指代语句形式。针对运动状态下所指物的定位问题，构建了基于自然语言-视频指代表达理解的人机交互系统，增强系统的准确性和鲁棒性。所开展的人机交互实验结果验证了该人机交互系统的可行性及有效性。
英文摘要	In recent years, the development of artificial intelligence technology has contribut-ed to significant changes in interactions between humans and robots from traditional interactions assisted by devices like mice and keyboards to natural human-robot interactions that are closer to human-to-human communication, such as natural language, gestures, and facial expressions. Due to the convenience and diversity of expressions, service robots often use natural language as an interactive way to achieve natural and comfortable human-robot interaction service. Faced with this interaction situation, how the robot understands the relationship between the speech, text, or other contents and scenes, determines objects referred to in the expressions, and completes the process of serving people, is the key technology to improve the quality of natural human-robot interaction. The language-visual object localization technology aims to locate referred objects in an image based on natural language. Therefore, this paper explores this technology for natural human-robot interaction to improve the interaction ability of service robots. This paper solves the problems of the human-robot interaction system that generalizes to new object categories as interaction scenes change, extends expression forms from explicit referring expressions to implicit ones, and grounds interactive objects from static to dynamic states. The tasks of zero-shot object detection, natural language-image referring expression comprehension, and natural language-video referring expression comprehension are explored step by step, and a natural human-robot interaction system is built for the application. The main contents and contributions of this paper are as follows. 1. For the untrained object detection problem in interactive scenes, a zero-shot object detection method based on a softplus margin focal loss is proposed. The method designs a semantic-visual mapping mechanism based on one-dimensional convolution. It effectively alleviates hubness problems in zero-shot learning, reduces model parameters and constrains the encoder construction process through a reconstruction loss function of the decoder. A softplus margin focal loss function is proposed to align the visual features and mapped semantic features in the visual space of the classification branch. It maintains the property of focal loss to solve the class imbalance problem and improves the discrimination of projections on positive and negative categories, distinguishing the foreground from the background of images. Furthermore, a localization branch that integrates semantic information is proposed in the paper, and a trainable matrix is designed for feature alignment. The proposed method is tested on four public datasets, and the results demonstrate its effectiveness. 2. For the natural language-image referring expression comprehension and segmentation problem, a cross-modality synergy network is proposed. An attention-aware representation learning module is designed to learn modal representations for images and expressions. It introduces a language self-attention submodule to establish the intrinsic relationships of expressions and obtain language features. Besides, it designs a language-guided channel-spatial attention submodule to highlight the expression-related regions in the images and suppress background interferences, generating language-aware visual representations. To fuse the visual and language features, a cross-modality synergy module is designed to explore the synergistic relationship between two modalities in semantic and spatial dimensions. Based on a feature selection strategy, a multi-scale feature fusion module is proposed to aggregate the relevant information of referents from multi-scale features and generate referred object prediction results. The proposed method is tested on four public datasets, and the results verify its effectiveness. 3. For the natural language-video referring expression comprehension problem, a multi-stage image-language cross-modality generation-fusion network is proposed. It designs a frame-dense feature aggregation module for learning the keyframe features by its adjacent consecutive frames, ensuring the consistency of the object localization before and after the keyframes. As for the feature fusion problem, an image-language cross-modality generation-fusion module is proposed, which acts as the main module of the multi-stage learning. It applies an image-language similarity to get the cross-modality features and finely fuses the obtained image and language features. To enhance the cross-modal generation capability of the module, a consistency loss function of referent localization and language expression features is designed to constrain the image-language similarity and language-image similarity matrixes in the feature generation. Experimental results on four public datasets demonstrate the effectiveness of the proposed method. 4. For the natural human-robot interaction problem, a natural human-robot interaction system based on language-visual object localization is designed, and a service robot interaction platform composed of hardware and software systems is built to realize the system. In order to transfer the interaction system from fixed categories of the interaction scene to new categories, a human-robot interaction system based on object category matching is constructed using a zero-shot object detection method, which transfers model performance from seen categories to unseen categories to solve the problem. In order to alleviate the limitation of localizing interactive objects by object category matching expressions, a human-robot interaction system based on a natural language-image referring expression comprehension model is built by directly establishing the relationships between objects and expressions, so that the system can deal with implicit referring expressions that do not involve object categories. To localize referents in motion states, a human-robot interaction system is further constructed by a natural language-video referring expression comprehension model to improve its accuracy and robustness. The experimental results of human-robot interaction experiments show the feasibility and effectiveness of the system.
关键词	自然人机交互零样本物体检测自然语言-图像指代表达理解自然语言-视频指代表达理解
学科领域	机器人控制 ; 人工智能理论 ; 自然语言处理 ; 模式识别
学科门类	工学::控制科学与工程 ; 工学::计算机科学与技术（可授工学、理学学位）
语种	中文
资助项目	National Natural Science Foundation of China[61333016] ; National Natural Science Foundation of China[61333016]
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48487
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	李钱钟. 面向自然人机交互的语言-视觉物体定位方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向自然人机交互的语言-视觉物体定位方法（42933KB）	学位论文		限制开放	CC BY-NC-SA