面向交互场景的情感识别研究
连政
2021
页数106
学位类型博士
中文摘要

情感识别是一项通过分析情感表达时所产生的生理反应和行为反应来识别
情感状态的技术。作为人工智能领域的一个重要分支,情感识别在交互、教育、
安全、金融等领域应用广泛。随着人机交互系统与社交网络平台的大规模部署与
应用,利用智能设备进行人机、人人交互已成为人们日常生活的一部分。但是现
有智能交互系统更多地关注于言语内容理解,未能充分考虑情感信息,这影响着
交互系统的自然度和适人性。因而,面向交互场景的情感识别技术逐渐受到国内
外研究人员的广泛关注。本文分别从情感特征提取、多模态信息融合以及个体信
息建模三个方面对交互场景下的情感识别方法展开研究,主要创新成果如下:
1. 在情感特征提取层面,本文通过学习具有区分性的情感特征以提升情感
识别系统的性能。首先,针对不同情感状态区分度较差的问题,本文提出了基于
区分性损失函数的情感特征学习方法。所提方法利用对比损失函数和监督交叉
熵损失函数联合优化模型参数。对比损失函数能够减小类内距离并增大类间距
离,从而学到具有区分性的情感特征。监督交叉熵损失函数能够利用情感标签信
息指导生成具有情感导向性的特征。其次,针对情感数据资源受限的问题,本文
提出了基于迁移学习的情感特征提取方法,能够将无监督任务迁移至情感识别
任务中,在情感数据资源受限条件下有效提升了情感识别系统的性能。
2. 在多模态信息融合层面,本文提出了一种基于跨模态关联建模的情感识
别方法。所提方法采用词级别文本特征和段级别声学特征作为输入,这两类特征
序列长度不同,存在着天然的“不对齐”现象。本文利用 Transformer 结构实现
不同模态数据流的自动对齐,进而学习跨模态关联信息以提升情感识别性能。
3. 在个体信息建模层面,本文提出了一种基于图神经网络的个体信息建模
策略。心理学研究表明,交互场景中每个个体的情感状态主要受两方面因素影
响:自我依赖特性和相互依赖特性。自我依赖特性是指一个人的情感状态存在连
续性。相互依赖特性是指一个人的情感状态也会受到其他个体的影响。本文采
用图神经网络建模这两类特性。所提方法首先将每个句子表示为图中节点。为
了建模自我依赖特性,当前个体的句子节点与其上一时刻的句子节点用边相连。
为了建模相互依赖特性,当前个体的句子节点与上一时刻其他个体的句子节点用边相连。同时,所提方法采用不同类型的边对这两类特性进行建模。本文将所
提方法用于纠正预训练情感识别系统中误分类样本。实验结果表明,所提方法仅
需较少的模型参数量和较低的计算复杂度就能有效提升情感识别系统的性能。

英文摘要

Emotion recognition is a technology that recognizes emotional states by analyzing the physiological and behavioral responses generated from emotion expression. As an important branch of artifcial intelligence, emotion recognition can be widely utilized in interaction, education, security and fnance. With the widespread applications of human-computer interaction systems and social network platforms, the use of smart devices for human-computer and human-to-human interaction has become a part of our daily life. However, existing interaction systems pay more attention to the speech content. They fail to fully utilize emotional information, which affects the naturalness and humanity. Therefore, emotion recognition technology oriented to interactive scenes has received numerous attentions from researches of home and abroad. This paper conducts research on emotion recognition methods in interactive scenes from three aspects: emotion feature extraction, multimodal fusion, and individual information modeling. The main contribution of this paper can be summarized as follows:

1. In the aspect of emotional feature extraction, this paper aims to improve the performance of emotion recognition by learning distinguishing emotional features. Firstly, considering the shortcoming of poor discrimination of different emotion states, this paper utilizes the discriminative loss function for emotional feature extraction. The proposed method uses the contrastive loss and supervised cross-entropy loss to jointly optimize the trainable parameters. The contrastive loss can reduce the intra-class distance and increase the inter-class distance, thus learning distinguishing emotional features.
The supervised cross-entropy loss utilizes emotional labels to learn emotional-oriented representations. Secondly, considering the problem of insufcient emotional data, this paper utilizes transfer learning approaches for emotional feature extraction. The proposed method can transfer the knowledge learned by the unsupervised model to the emotion recognition task, thus effectively improving the performance of the emotion
recognition system with limited labeled data.

2. In the aspect of multimodal fusion, this paper proposes an emotion recognition system based on cross-modal interactive modeling. This paper takes word-level lexical features and segment-level acoustic features as the inputs. These features have different sequence length and exhibit “unaligned” nature. This paper proposes a transformerbased multimodal fusion strategy to obtain optimal mapping between these features, and then learns cross-modal interactions to improve the performance of multimodal emotion recognition systems.


3. In the aspect of individual information modeling, this paper proposes an individual information modeling strategy based on graph neural networks. Psychological fndings reveal that emotional states of each individual in the interactive scenes are powered by two main factors: self-dependency and inter-personal dependency. Selfdependency depicts the speaker’s influence on himself. It is one trait in conversations that the speaker’s emotional state is highly continuous. Inter-personal dependency is another trait in conversations that a person’s emotional state can be influenced by the interlocutor’s behaviors. This paper utilizes graph neural networks to model these two core ideas in human interactions. We frst represent each individual utterance as the node. To model the impact of self-dependency, each utterance node of the current target speaker
has edges with the immediate utterance node of previous target speaker. To model the impact of inter-personal dependency, each utterance node of the current target speaker has edges with the immediate utterance node of previous opposite speaker. To distinguish these two influences, we use two relation types of edges. In this paper, we aim to automatically correct some errors made by pre-trained emotion recognition system.
Experimental results show that the proposed method can improve emotion recognition performance with few trainable parameters and low computational complexity.

关键词交互场景 情感识别 情感特征提取 多模态融合 个体信息建模
语种中文
七大方向——子方向分类多模态智能
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/44747
专题多模态人工智能系统全国重点实验室_智能交互
推荐引用方式
GB/T 7714
连政. 面向交互场景的情感识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis-Zheng Lian.pd(4140KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[连政]的文章
百度学术
百度学术中相似的文章
[连政]的文章
必应学术
必应学术中相似的文章
[连政]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。