基于时序上下文的连续情感识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于时序上下文的连续情感识别研究
	黄健
	2020-05-28
页数	128
学位类型	博士
中文摘要	情感识别是通过情绪表达时所产生的生理反应和行为反应去量化、描述和识别不同的情感状态，依据情感表示模型利用音视频等多模态信息构建情感识别模型。随着深度学习的发展，情感识别的技术得到了较大的提升，并成功应用于交互、教育、安全和金融等领域。情感是时序动态变化的，影响着情感识别系统的情感特征提取和情感识别模型两个关键模块。此外，情感通过多种模态表达，时序信息不仅在单模态情感识别中有着重要作用，而且在多模态情感融合时需要重点考虑。因此，有效地利用情感的时序上下文信息是情感识别系统的关键，也存在着许多挑战。本文分别从情感特征提取、情感识别模型和多模态情感融合三个方面对情感识别方法进行了改进和提升，主要创新成果如下: 在情感特征提取方面，本文以融合时序信息的情感特征提取为目标，提高情感特征的表现力和区分性。首先考虑到情感数据库规模较小，而基于无监督学习提取的情感特征不具有情感导向性，本文提出了基于半监督学习的衔梯网络提取情感特征，其中无监督任务融合情感数据本身的时序信息，并利用情感标签监督生成具有情感导向的情感表征。为了增强提取的情感表征的区分性，本文提出了基于区分性训练的情感特征提取模型，利用三元损失函数减小相同情感类别之间的距离和增大不同情感类别之间的距离，并基于NetVLAD方法更有效地编码情感信息，将输入信号与每个情感聚类的残差加权拼接得到具有区分性的情感表征，提升了语音情感识别正确率。在情感识别模型方面，本文以基于时序建模的连续情感识别模型为目标，提高情感时序建模的能力。本文首先探索了时序卷积网络、长短时记忆模型和多头注意力机制在情感时序建模的有效性，依据其内在结构基于情感的上下文信息构建连续情感识别模型。多头注意力模型基于自注意力机制可以关注长跨度的全局信息，在连续情感识别中具有较大的优势，并进一步组合不同的时序神经网络提高情感识别的性能。为了更好地利用原始情感数据信息，本文提出了端到端的连续情感识别系统，将情感特征提取和情感识别模型两个模块合并在一个网络结构中，避免了情感特征的选择困难。系统利用3D CNN建模视频数据的时空上下文信息自动学习情感表征，并引入ConvLSTM模型学习情感动态依赖性，提高了连续情感识别系统的性能。在多模态情感融合方面，本文以多模态融合的连续情感识别为目标，提高多模态情感信息的有效整合。本文立足于AVEC 2017多模态连续情感识别竞赛，针对连续情感数据的标注时延、数量较少以及交互影响等问题提出了改进方法，有效地增加了单模态情感识别模型性能。本文探索了决策层融合和特征层融合两种方法，最终获得了AVEC 2017竞赛的第二名。在此基础上，本文利用Transformer模型探索了多模态情感的模型层融合，通过多头注意力模块利用自注意力机制分别学习语音和视频单模态的情感时序依赖性，然后融合其高层输出完成音视频信息的协同，生成有效的多模态情感表征。基于多头注意力机制的模型层融合获得了比决策层融合和特征层融合更好的效果，提升了连续情感识别系统性能。
英文摘要	Emotion recognition is to quantify, describe and recognize different emotional states through the behavioral and physiological responses generated from emotional expressions. Based on emotional representive model, emotion recognition models are built using multimodal information such as audio and video. Recently, deep learning has greatly improved emotion recognition which is successfully applied in the fields of interaction, education, security and finance. Emotion is dynamic in time sequence, which has a great influence on the two key modules of emotion recognition systems: emotional features extraction and emotion recognition models. Besides, people express their emotions in multimodal way. Temporal information plays an important role not only in emotion recognition based on single modality, but also in multimodal emotion fusion. Thus, it is essential and challenging to take advantage of emotional temporal context effectively for emotion recognition systems. This paper explores emotion recognition from three aspects: emotional features extraction, emotion recognition models and multimodal emotion fusion. The main innovations are as follows: In the aspect of emotional features extraction, this paper aims at emotional features extraction based on temporal information and improves the expressiveness and discriminability of emotional features. Considering the shortcoming that emotional databases are small and emotional features extracted from unsupervised learning is not emotional oriented, this paper proposes a ladder network based on semi-supervised learning to extract emotional features. Its unsupervised task integrates temporal information of emotional data itself, and emotional labels supervise the generation of effective emotional representations. This paper utilizes NetVLAD method to encode emotional information better, and emotional features are the concatenation of weight of the residual between input signal and each emotion cluster. Our proposed methods improve the accuracy of speech emotion recognition. In the aspect of emotion recognition models, this paper aims at continuous emotion recognition based on temporal information and improves the ability of emotional temporal modeling. This paper firstly explores the validity of temporal convolutional neural network, long short term memory network and multi-head attention for emotional modeling, and builds continuous emotion recognition models based on the context information of emotional characteristics. The multi-head attention model utilizes self-attention mechanism to focus on long span global information, which is good at continuous emotion recognition. The paper further combines different temporal models to improve the performance. In order to make better use of original emotional data, this paper proposes an end-to-end continuous emotion recognition framework to merge emotional features extraction and emotion recognition models into a unified system, which avoids the difficulty of the selection of emotional features. The system utilizes 3D CNN to learn the temporal and spatial context information of video data to automatically generate emotional representations, and introduces ConvLSTM to model emotional dynamic dependence. Our proposed methods improve the performance of continuous emotion recognition. In the aspect of multimodal emotion fusion, this paper aims at effective concordance of multiple modalities for multimodal continuous emotion recognition. Based on AVEC 2017 multimodal continuous emotion recognition competition, this paper proposes the methods to handle the problems of annotation delay, deficiency and interlocutor influence of continuous emotional data. This paper explores decision level fusion and feature level fusion in the AVEC 2017 and wins the second place. On this basis, this paper utilizes the Transformer model to explore multimodal model level fusion. The multi-head attention mechanism firstly learns the emotional temporal dependence of audio modality and visual modality respectively, and then fuses their high-level outputs to generate effective multimodal emotional representations through the interaction of audio and visual information. Model level fusion based on multi-head attention achieves better performance than decision level fusion and feature level fusion. The system improves the performance of continuous emotion recognition.
关键词	情感识别连续情感表示模型情感特征情感时序建模多模态情感融合
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39308
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	黄健. 基于时序上下文的连续情感识别研究[D]. 中科院自动化所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于时序上下文的连续情感识别研究.pdf（12715KB）	学位论文		限制开放	CC BY-NC-SA