基于音视频的自动抑郁检测研究

	基于音视频的自动抑郁检测研究
	牛明月
	2021-05-29
页数	126
学位类型	博士
中文摘要	抑郁症是一种精神类疾病，它使人们长期地被消极情绪所困扰而无法正常参与生产活动并给社会和家庭带来严重负担。与其他疾病类似，早期的诊断和治疗对于减轻抑郁症带来的危害具有十分重要的作用，但是由于医患比例的严重失调和诊断过程的复杂性而导致很多患者无法及时发现自己的问题。因此，研究并开发一种自动抑郁检测系统对于改善当前医疗条件和提高医生工作效率都具有十分重要的现实意义。多年来，生理学和心理学的研究成果已经表明，抑郁症患者和健康个体在语音和面部活动上存着差异性表现。根据这些事实并考虑到音视频信号具有易采集、非接触等优点，通过音视频信息来检测个体的抑郁水平逐渐受到了国内外众多研究者的关注。本文就是以音视频模态为研究对象并利用机器学习技术来抽取其中能够刻画抑郁线索的表示从而预测个体的抑郁水平。论文的主要研究内容可以分为以下三个方面：针对当前工作中所使用的人脸特征难以刻画面部细节变化这一问题，本文受高阶梯度能够捕获细节信息的启发而设计了一种名为局部二阶梯度交叉模式的图像特征来刻画人脸的细节纹理结构。进一步，为了能够提取面部在不同尺度下的动态变化，本文对视频剪辑进行多尺度划分并在三正交平面动态描述框架下提取面部细节的动态特征。最后，考虑到抑郁分数具有类别属性和程度属性，本文采用了组间分类-组内回归的方法来预测个体的抑郁水平。该方法在AVEC 2013和AVEC 2014这两个抑郁数据集上的实验结果表明了所提方法的有效性。针对当前方法难以全面考察语音MFCC的时空信息且无法获得适合抑郁检测任务的范数类型进行池化的问题，本文提出一种基于混合神经网络和lp-范数池化的自动抑郁检测方法。具体来说，该方法使用混合网络来提取MFCC段的空间特征、时序特征和区分性表示并将三者的拼接记为段水平的特征。对于提取MFCC的空间特征来说，所提出的混合网络在卷积神经网络中添加了通道注意力模块来强调那些与抑郁检测相关的通道而抑制不相关的通道。对于提取MFCC的时序特征来说，本文设计了一种全局信息嵌入模块来使LSTM的输出序列中能够保持输入序列的全局信息而使那些与抑郁相关的信息不至于丢失。此外，空间特征和时序特征的拼接经过前馈网络的处理来获得区分性表示。进一步，本文将lp-范数池化和LASSO框架相结合来优化抑郁检测任务中将段水平特征聚合为长时特征的范数类型p。本文在AVEC 2013和AVEC 2014数据集上进行了相关实验并验证了所提方法的合理性和有效性。针对当前方法难以捕获音视频中有助于抑郁检测的关键帧以及在融合过程中缺少模态互补信息的问题，本文提出了一种多模态时空表示的自动抑郁检测方法。具体来说，本文构造了一种时空注意力网络来提取音频段和视频段水平特征。该网络一方面使用卷积神经网络来生成音频段或者视频段的空间表示。另一方面，该网络使用LSTM来生成音频段或者视频段的时序序列表示。然后，在空间表示和时序序列表示之间使用注意力机制以实现时空信息融合并强调时序序列中那些与抑郁检测相关的关键帧。进一步，本文使用特征进化池化方法来总结音频段和视频段水平特征在每一个维度的动态变化并生成相应的音频和视频水平特征。最后，本文使用所提出的多模态注意力特征融合策略来捕获不同模态之间的互补信息从而改善多模态表示的质量以提升抑郁检测的精度。通过在AVEC 2013和AVEC 2014这两个抑郁检测数据集上的实验表明，所提方法中各个模块以及整个框架对于抑郁检测任务来说是有效的。
英文摘要	Depression is a kind of mental disease, which makes people unable to participate in production activities due to long-term negative emotions and brings serious burden to society and families. Similar to other diseases, early diagnosis and treatment play a very important role in reducing the harm of depression, but the serious imbalance of the doctor-patient ratio and the complexity of the diagnosis process, many patients can not find their own problems in time. Therefore, the research and development of an automatic depression detection system is of great practical significance to improve the current medical conditions and the work efficiency of doctors. Over the years, physiological and psychological research results have shown that there are some differences in speech and facial activities between patients with depression and healthy individuals. According to these facts and considering the advantages of audio and video signals such as easy to collect and non-contact, the depression level detection by audio and video information has gradually attracted extensive attention of many researchers at home and abroad. This paper takes audio and video modalities as the research objects and uses the machine learning technologies to extract the representation, which can characterize the depression cues for predicting the individual depression level. The main contents of this paper include the following three aspects: In order to alleviate the problem that it is difficult to describe the facial details changes in the current works, inspired by the fact that high-order gradient can capture the details, a local image feature called local second-order gradient cross pattern is designed to characterize the facial detailed texture structure. Furthermore, for obtaining the dynamic changes of face at different scales, this paper divides the video clip into multi-scale and extracts the change representation under the three orthogonal plane dynamic description framework. Finally, considering the category and degree attributes of depression score, a method of between-group classification and within-group regression is employed to predict the individual depression level. The experimental results on AVEC 2013 and AVEC 2014 depression databases show the effectiveness of our proposed method. In order to solve the problem that it is difficult to investigate the spatio-temporal information of MFCC comprehensively and to obtain the type of norm suitable for pooling, this paper proposes an automatic depression detection method based on the hybrid neural network and lp-norm pooling. Specifically, this method uses hybrid network to extract the spatial feature, temporal feature and discriminative representation of MFCC segments. The concatenation of these features is recorded as segment level feature. For extracting the spatial feature of MFCC segments, the hybrid network plugs the channel attention module into the convolution neural network to emphasize the useful channels to depression detection and suppress the usefulness channels. For extracting the temporal feature of MFCC segments, this paper designs a global information embedding module to make the output sequence of LSTM contain the content of the input sequence and obtains the temporal feature of MFCC segment. In addition, the spatial and temporal features are concatenated and fed into the feedforward network to obtain the discriminative representation. Furthermore, the proposed method combines lp-norm pooling with LASSO framework to optimize the norm type p to aggregate segment level features into long-term feature. Experiments on AVEC 2013 and AVEC 2014 databases are carried out to verify the rationality and effectiveness of the proposed method. In order to solve the problem that current methods are difficult to capture key frames related to depression detection in audio and video. At the same time, these works lack of modal complementary information in the fusion process. Thus, this paper proposes an automatic depression detection method based on multimodal spatiotemporal representation. Specifically, a SpatioTemporal Attention (STA) network is constructed to extract audio and video segment level features. On the one hand, the STA network generates the spatial representation with the convolution neural network for audio or video segments. On the other hand, the STA network obtains the temporal sequence representation of audio or video segments using the LSTM. Then, the attention mechanism is used between the spatial feature and temporal sequence to fuse the spatiotemporal information and emphasize the key frames related to depression detection. Furthermore, the eigen evolution pooling is used to summarize the dynamic changes of segment level feature in each dimension and generate the corresponding audio and video level feature. Finally, a multimodal attention feature fusion strategy is proposed to capture the complementary information between different modalities to improve the quality of multimodal representation and obtain better depression detection accuracy. Experiments on two depression databases(i.e., AVEC 2013 and AVEC 2014) show the effectiveness of each module and the whole framework in the proposed method for depression detection task.
关键词	多模态抑郁检测面部纹理细节 MFCC的时空属性池化的范数类型关键的音视频帧注意力机制模态互补信息
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44390
专题	多模态人工智能系统全国重点实验室_智能交互
推荐引用方式 GB/T 7714	牛明月. 基于音视频的自动抑郁检测研究[D]. 北京. 中国科学院自动化研究所,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
牛明月博士毕业论文.pdf（3264KB）	学位论文		开放获取	CC BY-NC-SA