英文摘要 | Emotion recognition is to quantify, describe and recognize different emotional states through the behavioral and physiological responses generated from emotional expressions. Based on emotional representive model, emotion recognition models are built using multimodal information such as audio and video. Recently, deep learning has greatly improved emotion recognition which is successfully applied in the fields of interaction, education, security and finance. Emotion is dynamic in time sequence, which has a great influence on the two key modules of emotion recognition systems: emotional features extraction and emotion recognition models. Besides, people express their emotions in multimodal way. Temporal information plays an important role not only in emotion recognition based on single modality, but also in multimodal emotion fusion. Thus, it is essential and challenging to take advantage of emotional temporal context effectively for emotion recognition systems. This paper explores emotion recognition from three aspects: emotional features extraction, emotion recognition models and multimodal emotion fusion. The main innovations are as follows:
In the aspect of emotional features extraction, this paper aims at emotional features extraction based on temporal information and improves the expressiveness and discriminability of emotional features. Considering the shortcoming that emotional databases are small and emotional features extracted from unsupervised learning is not emotional oriented, this paper proposes a ladder network based on semi-supervised learning to extract emotional features. Its unsupervised task integrates temporal information of emotional data itself, and emotional labels supervise the generation of effective emotional representations. This paper utilizes NetVLAD method to encode emotional information better, and emotional features are the concatenation of weight of the residual between input signal and each emotion cluster. Our proposed methods improve the accuracy of speech emotion recognition.
In the aspect of emotion recognition models, this paper aims at continuous emotion recognition based on temporal information and improves the ability of emotional temporal modeling. This paper firstly explores the validity of temporal convolutional neural network, long short term memory network and multi-head attention for emotional modeling, and builds continuous emotion recognition models based on the context information of emotional characteristics. The multi-head attention model utilizes self-attention mechanism to focus on long span global information, which is good at continuous emotion recognition. The paper further combines different temporal models to improve the performance. In order to make better use of original emotional data, this paper proposes an end-to-end continuous emotion recognition framework to merge emotional features extraction and emotion recognition models into a unified system, which avoids the difficulty of the selection of emotional features. The system utilizes 3D CNN to learn the temporal and spatial context information of video data to automatically generate emotional representations, and introduces ConvLSTM to model emotional dynamic dependence. Our proposed methods improve the performance of continuous emotion recognition.
In the aspect of multimodal emotion fusion, this paper aims at effective concordance of multiple modalities for multimodal continuous emotion recognition. Based on AVEC 2017 multimodal continuous emotion recognition competition, this paper proposes the methods to handle the problems of annotation delay, deficiency and interlocutor influence of continuous emotional data. This paper explores decision level fusion and feature level fusion in the AVEC 2017 and wins the second place. On this basis, this paper utilizes the Transformer model to explore multimodal model level fusion. The multi-head attention mechanism firstly learns the emotional temporal dependence of audio modality and visual modality respectively, and then fuses their high-level outputs to generate effective multimodal emotional representations through the interaction of audio and visual information. Model level fusion based on multi-head attention achieves better performance than decision level fusion and feature level fusion. The system improves the performance of continuous emotion recognition. |
修改评论