多模态感知的对话技术研究

CASIA OpenIR > 毕业生 > 博士学位论文

	多模态感知的对话技术研究
	陈飞龙
	2023-05-22
页数	154
学位类型	博士
中文摘要	语言是人类智慧的重要标志，也是人类文明快速发展和延续的关键因素。对话交流是人类使用和发展语言能力的重要形式。回顾人类对话能力的产生，不难发现，人类对话能力的习得来源于与外界物理环境信息的多方式、多层次的交互。从婴幼儿开始，人类就在物理环境中通过交互的方式进行认知和学习。物理环境信息以多种模态的形式存在，包括图像、视频、音频、文字等形式。多种模态的信息刺激着人类的视觉，听觉等多种感知器官。在与环境交互中习得的对话能力促进着人类文化的交流和文明的延续。从1950年图灵测试被提出之后人工智能研究者就致力于研究对话系统。随着人工智能技术的发展，对话系统的多模态化是机器与人更加自然交互的一个必经方向。从研究的角度，多模态感知的对话系统将是人机协同和交互的未来范式。如何感知多模态信息，如何将感知到的多模态信息融入对话模型中，引入多模态信息将产生何种影响，都是亟待回答的问题。从应用的角度，对话即服务，对话是人与智能系统关于多模态物理环境进行交互、协同的最自然的形式，多模态感知的对话系统具有广泛的应用前景和巨大的商业价值。本文从多模态对话任务出发，以对话为主体，侧重于``多模态感知''+``对话''的研究范式，研究与多种多模态感知适配的对话回复方法。本文面向视觉感知、视觉-语言联合感知与多模态混合感知，针对多模态对话任务中多模态信息调控、多模态概念对齐与习得、多模态多任务混合的挑战，分别提出多步推理融合、多粒度视觉-语言对齐及提示引导的对话回复方法。本文的研究内容和主要创新点目标包括以下三个方面：（1）研究基于视觉感知的多步推理融合的对话生成方法。该方法面向视觉感知的场景，提出一种视觉和语言信息（当前问题与对话历史）之间双通道的多步融合与调制机制，弥补了现有方法中缺乏双通道多步推理的、受视觉信息调控的对话回复方法的不足。该框架能够在回答多轮问题的过程中，通过双通道的多步的注意力转移，在对话历史信息和图片信息中寻找问题相关的正确的线索，并将视觉信息融入对话回复生成过程中，提升视觉感知的对话系统对话回复的准确性和信息丰富性。（2）研究基于视觉-语言感知的多粒度视觉-语言对齐的对话检索方法。该方法面向视觉-语言两模态联合感知的场景，利用其视觉感知能力与语言感知能力，将视觉概念与语言实体进行多粒度的对齐，使得对话系统对同一概念的不同模态的表现形式有一个统一的认知，填补现有方法中缺乏图文两模态间相同概念对齐与习得的缺陷，进而提升视觉-语言联合感知的对话系统产生对话响应的准确性与语义一致性。（3）研究基于多模态混合感知的提示引导的多任务对话回复方法。该方法面向多模态混合感知的场景，在前两项研究的基础上，首先将模态进行扩展，引入语音模态，然后基于提示学习和类语言化结构，通过大规模的多模态、多类型对话预训练，将对话理解与生成能力在单模态对话与多模态对话任务中，以及在闲聊型对话与任务型对话中进行迁移，实现模态统一、任务统一的对话模型，填补现有方法没有能够同时完成多种模态混合、多种类型对话任务混合的模型的空缺，提升对话模型的通用性与可交互性。
英文摘要	Language is an important symbol of human intelligence and a key factor in the rapid development and continuity of human civilization. Intellectual activity and language are inseparable. Dialogue communication is an important form of human language use and development. Looking back at the emergence of human dialogue ability, it is not difficult to find that the acquisition of human dialogue ability comes from the multimodal and multi-level interaction with the external physical environment information. Starting from infancy, humans engage in cognitive and learning activities through interactive means in the physical environment. Physical environmental information exists in various modalities, including images, videos, audio, and text. Multimodal information stimulates human visual, auditory, and other sensory organs, and is closely related to the generation and understanding of dialogue. The dialogue ability acquired through interaction with the environment promotes the exchange of human culture and the continuity of human civilization. Since the Turing Test was proposed in 1950, researchers have been committed to researching dialogue systems. With the development of artificial intelligence, multimodal dialogue systems are a necessary direction for more natural interaction between machines and humans. From a research perspective, multimodal perception-based dialogue systems will be the future paradigm of human-machine collaboration and interaction. How to perceive multimodal information, how to integrate perceived multimodal information into the dialogue model, and what impact will the introduction of multimodal information have are urgent questions to be answered. From an application perspective, dialogue as a service is the most natural form of interaction and collaboration between humans and intelligent systems regarding multimodal physical environments. Multimodal perception-based dialogue systems have broad application prospects and huge commercial value. Starting from the multimodal dialogue task, this thesis focuses on ``multimodal perception'' + ``dialogue'' and studies dialogue generation adapted to various multimodal perceptions. Regarding visual perception, visual-language joint perception, and multimodal mixed perception, this thesis proposes dialogue generation methods with multi-step reasoning, multi-granularity visual-language alignment, and prompt guidance to solve the challenges of multimodal information regulation, multimodal concept alignment and multimodal multitask mixing in multimodal dialogue tasks. The research content and main innovative points of this thesis include the following three aspects: (1) Research on a dialogue generation method based on multi-step reasoning with visual perception. This method aims at visual perception scenarios, proposing a dual-channel multi-step reasoning mechanism between visual and language information (current questions and dialogue history), which addresses the lack of dual-channel multi-step reasoning and visually-regulated dialogue response methods in existing approaches. The framework can find the correct clues related to the question by transferring attention between dialogue history and image information through dual-channel multi-step reasoning. And then it integrates visual information into the dialogue response generation process, enhancing the accuracy and information richness of the visual perception-based dialogue system's responses. (2) Research on a dialogue retrieval method based on multi-granularity visual-language alignment with visual-language perception. This method aims at scenarios where visual and language modalities are jointly perceived. It aligns visual concepts with language entities at multiple granularities using their visual and language perception abilities, making the dialogue system a unified understanding of different modal representations of the same concept. It solves the problem of the lack of alignment and learning of the same concepts between the two modalities in existing methods. The proposed method improves the accuracy and semantic consistency of the visual-language joint perception-based dialogue system's response retrieval. (3) Research on a multitask dialogue method guided by prompt and based on multimodal mixed perception. This method aims at scenarios of multimodal mixed perception. Based on the first two studies, it extends the modalities by introducing the speech modality. Then, through large-scale multimodal and multi-task dialogue pre-training based on prompt learning and quasi-linguistic bridges, the dialogue understanding and generation abilities are transferred between single-modal and multimodal dialogue tasks and between chitchat and task-oriented dialogues. It realizes the modality-unified and task-unified dialogue model, filling the gap of existing methods that cannot simultaneously complete various modality-mixed and task-mixed dialogue tasks, and enhancing the generality and interactivity of the dialogue model.
关键词	自然语言处理，对话系统，多模态感知，多模态融合，对话推理
语种	中文
是否为代表性论文	是
七大方向——子方向分类	自然语言处理
国重实验室规划方向分类	人机混合智能
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51887
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	陈飞龙. 多模态感知的对话技术研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
201818020428006陈飞龙.p（34661KB）	学位论文		限制开放	CC BY-NC-SA