面向交互场景的情感识别研究

	面向交互场景的情感识别研究
	连政
	2021
页数	106
学位类型	博士
中文摘要	情感识别是一项通过分析情感表达时所产生的生理反应和行为反应来识别情感状态的技术。作为人工智能领域的一个重要分支，情感识别在交互、教育、安全、金融等领域应用广泛。随着人机交互系统与社交网络平台的大规模部署与应用，利用智能设备进行人机、人人交互已成为人们日常生活的一部分。但是现有智能交互系统更多地关注于言语内容理解，未能充分考虑情感信息，这影响着交互系统的自然度和适人性。因而，面向交互场景的情感识别技术逐渐受到国内外研究人员的广泛关注。本文分别从情感特征提取、多模态信息融合以及个体信息建模三个方面对交互场景下的情感识别方法展开研究，主要创新成果如下： 1. 在情感特征提取层面，本文通过学习具有区分性的情感特征以提升情感识别系统的性能。首先，针对不同情感状态区分度较差的问题，本文提出了基于区分性损失函数的情感特征学习方法。所提方法利用对比损失函数和监督交叉熵损失函数联合优化模型参数。对比损失函数能够减小类内距离并增大类间距离，从而学到具有区分性的情感特征。监督交叉熵损失函数能够利用情感标签信息指导生成具有情感导向性的特征。其次，针对情感数据资源受限的问题，本文提出了基于迁移学习的情感特征提取方法，能够将无监督任务迁移至情感识别任务中，在情感数据资源受限条件下有效提升了情感识别系统的性能。 2. 在多模态信息融合层面，本文提出了一种基于跨模态关联建模的情感识别方法。所提方法采用词级别文本特征和段级别声学特征作为输入，这两类特征序列长度不同，存在着天然的“不对齐”现象。本文利用 Transformer 结构实现不同模态数据流的自动对齐，进而学习跨模态关联信息以提升情感识别性能。 3. 在个体信息建模层面，本文提出了一种基于图神经网络的个体信息建模策略。心理学研究表明，交互场景中每个个体的情感状态主要受两方面因素影响：自我依赖特性和相互依赖特性。自我依赖特性是指一个人的情感状态存在连续性。相互依赖特性是指一个人的情感状态也会受到其他个体的影响。本文采用图神经网络建模这两类特性。所提方法首先将每个句子表示为图中节点。为了建模自我依赖特性，当前个体的句子节点与其上一时刻的句子节点用边相连。为了建模相互依赖特性，当前个体的句子节点与上一时刻其他个体的句子节点用边相连。同时，所提方法采用不同类型的边对这两类特性进行建模。本文将所提方法用于纠正预训练情感识别系统中误分类样本。实验结果表明，所提方法仅需较少的模型参数量和较低的计算复杂度就能有效提升情感识别系统的性能。
英文摘要	Emotion recognition is a technology that recognizes emotional states by analyzing the physiological and behavioral responses generated from emotion expression. As an important branch of artifcial intelligence, emotion recognition can be widely utilized in interaction, education, security and fnance. With the widespread applications of human-computer interaction systems and social network platforms, the use of smart devices for human-computer and human-to-human interaction has become a part of our daily life. However, existing interaction systems pay more attention to the speech content. They fail to fully utilize emotional information, which aﬀects the naturalness and humanity. Therefore, emotion recognition technology oriented to interactive scenes has received numerous attentions from researches of home and abroad. This paper conducts research on emotion recognition methods in interactive scenes from three aspects: emotion feature extraction, multimodal fusion, and individual information modeling. The main contribution of this paper can be summarized as follows: 1. In the aspect of emotional feature extraction, this paper aims to improve the performance of emotion recognition by learning distinguishing emotional features. Firstly, considering the shortcoming of poor discrimination of diﬀerent emotion states, this paper utilizes the discriminative loss function for emotional feature extraction. The proposed method uses the contrastive loss and supervised cross-entropy loss to jointly optimize the trainable parameters. The contrastive loss can reduce the intra-class distance and increase the inter-class distance, thus learning distinguishing emotional features. The supervised cross-entropy loss utilizes emotional labels to learn emotional-oriented representations. Secondly, considering the problem of insufcient emotional data, this paper utilizes transfer learning approaches for emotional feature extraction. The proposed method can transfer the knowledge learned by the unsupervised model to the emotion recognition task, thus eﬀectively improving the performance of the emotion recognition system with limited labeled data. 2. In the aspect of multimodal fusion, this paper proposes an emotion recognition system based on cross-modal interactive modeling. This paper takes word-level lexical features and segment-level acoustic features as the inputs. These features have diﬀerent sequence length and exhibit “unaligned” nature. This paper proposes a transformerbased multimodal fusion strategy to obtain optimal mapping between these features, and then learns cross-modal interactions to improve the performance of multimodal emotion recognition systems. 3. In the aspect of individual information modeling, this paper proposes an individual information modeling strategy based on graph neural networks. Psychological fndings reveal that emotional states of each individual in the interactive scenes are powered by two main factors: self-dependency and inter-personal dependency. Selfdependency depicts the speaker’s inﬂuence on himself. It is one trait in conversations that the speaker’s emotional state is highly continuous. Inter-personal dependency is another trait in conversations that a person’s emotional state can be inﬂuenced by the interlocutor’s behaviors. This paper utilizes graph neural networks to model these two core ideas in human interactions. We frst represent each individual utterance as the node. To model the impact of self-dependency, each utterance node of the current target speaker has edges with the immediate utterance node of previous target speaker. To model the impact of inter-personal dependency, each utterance node of the current target speaker has edges with the immediate utterance node of previous opposite speaker. To distinguish these two inﬂuences, we use two relation types of edges. In this paper, we aim to automatically correct some errors made by pre-trained emotion recognition system. Experimental results show that the proposed method can improve emotion recognition performance with few trainable parameters and low computational complexity.
关键词	交互场景情感识别情感特征提取多模态融合个体信息建模
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44747
专题	多模态人工智能系统全国重点实验室_智能交互
推荐引用方式 GB/T 7714	连政. 面向交互场景的情感识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis-Zheng Lian.pd（4140KB）	学位论文		开放获取	CC BY-NC-SA