CASIA OpenIR  > 毕业生  > 硕士学位论文
基于深度认知神经网络的人体行为检测与识别方法研究
李林
2018-05-25
学位类型工程硕士
英文摘要
近年来,基于深度学习方法的行为的识别与检测被越来越多的应用在了现实
生活场景当中。人体行为分析和深度学习理论是智能视频分析的研究热点,是智
能视频分析与理解,视频监控,人机交互等诸多领域的理论基础,近年来得到了学
术界和工程界的广泛重视。目前人体行为的分析主要包括视频中行人做动作的位
置的检测和具体在做哪种行为的类别判断。前者主要是解决where 的问题,而后
者主要是解决what 的问题,对于现实生活中自然场景下面的长视频或者安防监控
领域的视频来说,只有同时解决这两个本质的问题,才能够真正意义上解决现实
生活中人体行为分析的所有需求。本论文首先从人体行为检测问题入手,针对于
长视频提出自己的方法,主要是在一个长视频中在时间的维度上定位到有行为发
生的起始时间和结束时间,并正确给出对应的类别。然后过渡到短视频的分析,即
给出一个短视频中行人做动作的类别标签。论文的主要工作和创新点归纳如下:
1) 提出了一种融合卷积神经网络和循环神经网络的视频行为检测方法
目前比较常用的人体行为检测方法都是基于3D 卷积网络的方法,但是3D 卷
积网络一般只能处理短时的视频片段,对于时间跨越尺度较大的行为视频片段来
说,3D 卷积网络一般很难体现一个相对来说比较好的时序的信息,为此我们提出
了一种融合3D 卷积网络和循环神经网络的视频行为检测方法。该方法能够充分
体现长短时不同尺度的视频的时序信息。我们采用的数据集是公开的THUMOS14
数据集,评价指标是检测问题常用的mAP(mean average precision) 值。实验结果
表明,基于3D 卷积网络和循环神经网络的方法在行为检测问题上要优于单纯的
只用3D 卷积网络的方法。
2) 提出了一种基于层次化注意力机制的行为识别方法
我们目前所做的基于RGB 视频的行为识别,与其说其是一个行为识别任务,
倒不如说其更像是一个场景分类问题。那么只要能够提取出每张图片里面的感兴
趣的区域,就一定能够帮助网络更好的进行识别,除了物体检测的方法我们还可
以采用基于注意力机制的方法。原始的注意力机制方法仅仅是对最高层的语义特
征进行建模,我们提出一个层次化的注意力机制模型可以考虑到不同层次不同细
粒度的特征,实验结果表明,这种基于层次化注意力机制的模型要好于最原始的
注意力机制模型方法。
3) 提出了一种基于视频时序特征编码的行为识别方法
基于双流网络的方法是当前行为识别的一个主流,通常做法是把视频分解成
空域和时域两种不同的模态之后分别进行分析。然而,目前的双流网络的方法有
一个缺点,由于在网络训练的过程中是基于单帧的方式进行训练,所以网络最后
学到的特征也是基于帧级别的特征。因此对于那些背景相似的类别容易引起混淆,
所以我们提出一种基于视频时序特征编码的行为识别方法,这种方法首先提取一
个视频在时间维度上不同尺度的局部特征,之后将所有局部特征整合到一起成为
一个全局特征,这种健壮的特征表达方式也仍然可以区分那些易混淆的行为类别,
我们采用的数据集是UCF101 和HMDB51 数据集,实验结果表明,我们取得了当
前识别的最好的性能。
关键词:行为识别,行为检测,双流网络,3D 卷积网络, 循环神经网络
;
In recent years, deep learning methods for human action recognition and detection
have been widely used in daily lives. Human behavior analysis and deep
learning theory are the hotspot of research in intelligent video analysis. They are
the basis of video surveillance, human-computer interaction and many other fields.
Recently they have received extensive attention from academia and engineering.
The analysis of human behavior consists of video pedestrian detection and human
behavior classification. The former is mainly to solve the problem of ”where” while
the latter is mainly to focus on the problem of ”what”. For the natural scenes in real
life or security surveillance videos, only to solve the two essential problems at the
same time, can truly solve all the demands for human behavior analysis. We start
from the detection of human action first, put forward our own methods. Mainly
focusing on how to locate the start time and the end time of the behavior precisely
and meanwhile classifying them correctly. Then there is a transition to the analysis
of short videos, which only give the action labels belong to them. In summary, the
contributions and the main work could be summarized as follows:
1) We propose a novel video action detection method which fuses convolutional
neural network and recurrent neural network together.
At present, the commonly used detection methods for video action detection
are based upon 3D convolution neural network. However there also exist some disadvantages which are 3D convolution network could only deal with short video clips,
as for those large scale video clips, 3D convolutional network is generally difficult to
reflect a relatively good information in the temporal dimension. Then we propose a
fusion of 3D convolutional network and recurrent neural network video action detection
method. This method could fully reflect the temporal information of different
time scale. We use the public dataset called THUMOS14 to evaluate our method,
the evaluation criteria is the commonly used mAP (mean average precision) value.
Experimental results show that our algorithm is superior to the method which only
use 3D convolution neural network.
2) We propose a hierarchial attention method for video action recognition.
Nowadays, we all focus on the task of action recognition based upon RGB
videos. It is more like a scene classification rather than an action recognition task.
So as long as we could extract those regions of interest, we could make some improvement.
Those regions of interest do good to help network learn what they need
to learn. In addition to object detection we can also use the methods based on
attention mechanism. The original attention mechanism only considers the highest
level of semantic information. However we propose a hierarchical attention model
which takes into account the different characteristics of different levels of granularity.
Experimental results demonstrate that our method is better than the original
attention model.
3) We propose a new human behavior analysis method based upon temporal
feature encoding.
Two stream methods are the current mainstream for video action recognition
based on RGB videos. They usually split video into two modalities。However, there
exists a drawback in original two stream network. The sampling strategy for the
network is all relied on single frame. Because of such sampling strategy, they are
typically limited to processing shorter sequences, which might cause the problems
such as suffering from the confusion by partial observation. So here we propose a
novel video feature representation method, called Deep Temporal Feature Encoding
(DTE). It could aggregate frame-level features into a robust and global video-level
representation. Such robust representation could still distinguish between those confusing categories. Comprehensive experiments are conducted on two public datasets: HMDB51 and UCF101. Experimental results demonstrate that DTE achieves the competitive state-of-the-art performance on both datasets.
Keywords: Video Action Recognition, Action Detection, Two-stream Network, 3D
Convolution Neural Network, Recurrent Neural Network

关键词行为识别 行为检测 双流网络
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21475
专题毕业生_硕士学位论文
作者单位中国科学院自动化研究所
第一作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
李林. 基于深度认知神经网络的人体行为检测与识别方法研究[D]. 北京. 中国科学院研究生院,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
毕业论文-人体行为分析.pdf(2872KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[李林]的文章
百度学术
百度学术中相似的文章
[李林]的文章
必应学术
必应学术中相似的文章
[李林]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。