基于深度认知神经网络的人体行为检测与识别方法研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于深度认知神经网络的人体行为检测与识别方法研究
	李林
	2018-05-25
学位类型	工程硕士
英文摘要	近年来，基于深度学习方法的行为的识别与检测被越来越多的应用在了现实生活场景当中。人体行为分析和深度学习理论是智能视频分析的研究热点，是智能视频分析与理解，视频监控，人机交互等诸多领域的理论基础，近年来得到了学术界和工程界的广泛重视。目前人体行为的分析主要包括视频中行人做动作的位置的检测和具体在做哪种行为的类别判断。前者主要是解决where 的问题，而后者主要是解决what 的问题，对于现实生活中自然场景下面的长视频或者安防监控领域的视频来说，只有同时解决这两个本质的问题，才能够真正意义上解决现实生活中人体行为分析的所有需求。本论文首先从人体行为检测问题入手，针对于长视频提出自己的方法，主要是在一个长视频中在时间的维度上定位到有行为发生的起始时间和结束时间，并正确给出对应的类别。然后过渡到短视频的分析，即给出一个短视频中行人做动作的类别标签。论文的主要工作和创新点归纳如下： 1) 提出了一种融合卷积神经网络和循环神经网络的视频行为检测方法目前比较常用的人体行为检测方法都是基于3D 卷积网络的方法，但是3D 卷积网络一般只能处理短时的视频片段，对于时间跨越尺度较大的行为视频片段来说，3D 卷积网络一般很难体现一个相对来说比较好的时序的信息，为此我们提出了一种融合3D 卷积网络和循环神经网络的视频行为检测方法。该方法能够充分体现长短时不同尺度的视频的时序信息。我们采用的数据集是公开的THUMOS14 数据集，评价指标是检测问题常用的mAP(mean average precision) 值。实验结果表明，基于3D 卷积网络和循环神经网络的方法在行为检测问题上要优于单纯的只用3D 卷积网络的方法。 2) 提出了一种基于层次化注意力机制的行为识别方法我们目前所做的基于RGB 视频的行为识别，与其说其是一个行为识别任务，倒不如说其更像是一个场景分类问题。那么只要能够提取出每张图片里面的感兴趣的区域，就一定能够帮助网络更好的进行识别，除了物体检测的方法我们还可以采用基于注意力机制的方法。原始的注意力机制方法仅仅是对最高层的语义特征进行建模，我们提出一个层次化的注意力机制模型可以考虑到不同层次不同细粒度的特征，实验结果表明，这种基于层次化注意力机制的模型要好于最原始的注意力机制模型方法。 3) 提出了一种基于视频时序特征编码的行为识别方法基于双流网络的方法是当前行为识别的一个主流，通常做法是把视频分解成空域和时域两种不同的模态之后分别进行分析。然而，目前的双流网络的方法有一个缺点，由于在网络训练的过程中是基于单帧的方式进行训练，所以网络最后学到的特征也是基于帧级别的特征。因此对于那些背景相似的类别容易引起混淆，所以我们提出一种基于视频时序特征编码的行为识别方法，这种方法首先提取一个视频在时间维度上不同尺度的局部特征，之后将所有局部特征整合到一起成为一个全局特征，这种健壮的特征表达方式也仍然可以区分那些易混淆的行为类别，我们采用的数据集是UCF101 和HMDB51 数据集，实验结果表明，我们取得了当前识别的最好的性能。关键词：行为识别，行为检测，双流网络，3D 卷积网络, 循环神经网络 ; In recent years, deep learning methods for human action recognition and detection have been widely used in daily lives. Human behavior analysis and deep learning theory are the hotspot of research in intelligent video analysis. They are the basis of video surveillance, human-computer interaction and many other fields. Recently they have received extensive attention from academia and engineering. The analysis of human behavior consists of video pedestrian detection and human behavior classification. The former is mainly to solve the problem of ”where” while the latter is mainly to focus on the problem of ”what”. For the natural scenes in real life or security surveillance videos, only to solve the two essential problems at the same time, can truly solve all the demands for human behavior analysis. We start from the detection of human action first, put forward our own methods. Mainly focusing on how to locate the start time and the end time of the behavior precisely and meanwhile classifying them correctly. Then there is a transition to the analysis of short videos, which only give the action labels belong to them. In summary, the contributions and the main work could be summarized as follows: 1) We propose a novel video action detection method which fuses convolutional neural network and recurrent neural network together. At present, the commonly used detection methods for video action detection are based upon 3D convolution neural network. However there also exist some disadvantages which are 3D convolution network could only deal with short video clips, as for those large scale video clips, 3D convolutional network is generally difficult to reflect a relatively good information in the temporal dimension. Then we propose a fusion of 3D convolutional network and recurrent neural network video action detection method. This method could fully reflect the temporal information of different time scale. We use the public dataset called THUMOS14 to evaluate our method, the evaluation criteria is the commonly used mAP (mean average precision) value. Experimental results show that our algorithm is superior to the method which only use 3D convolution neural network. 2) We propose a hierarchial attention method for video action recognition. Nowadays, we all focus on the task of action recognition based upon RGB videos. It is more like a scene classification rather than an action recognition task. So as long as we could extract those regions of interest, we could make some improvement. Those regions of interest do good to help network learn what they need to learn. In addition to object detection we can also use the methods based on attention mechanism. The original attention mechanism only considers the highest level of semantic information. However we propose a hierarchical attention model which takes into account the different characteristics of different levels of granularity. Experimental results demonstrate that our method is better than the original attention model. 3) We propose a new human behavior analysis method based upon temporal feature encoding. Two stream methods are the current mainstream for video action recognition based on RGB videos. They usually split video into two modalities。However, there exists a drawback in original two stream network. The sampling strategy for the network is all relied on single frame. Because of such sampling strategy, they are typically limited to processing shorter sequences, which might cause the problems such as suffering from the confusion by partial observation. So here we propose a novel video feature representation method, called Deep Temporal Feature Encoding (DTE). It could aggregate frame-level features into a robust and global video-level representation. Such robust representation could still distinguish between those confusing categories. Comprehensive experiments are conducted on two public datasets: HMDB51 and UCF101. Experimental results demonstrate that DTE achieves the competitive state-of-the-art performance on both datasets. Keywords: Video Action Recognition, Action Detection, Two-stream Network, 3D Convolution Neural Network, Recurrent Neural Network
关键词	行为识别行为检测双流网络
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/21475
专题	毕业生_硕士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	李林. 基于深度认知神经网络的人体行为检测与识别方法研究[D]. 北京. 中国科学院研究生院,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
毕业论文-人体行为分析.pdf（2872KB）	学位论文		限制开放	CC BY-NC-SA