CASIA OpenIR  > 毕业生  > 硕士学位论文
Thesis Advisor王亮 ; 张兆翔
Degree Grantor中国科学院研究生院
Place of Conferral北京
Keyword行为识别 行为检测 双流网络
Other Abstract
置的检测和具体在做哪种行为的类别判断。前者主要是解决where 的问题,而后
者主要是解决what 的问题,对于现实生活中自然场景下面的长视频或者安防监控
1) 提出了一种融合卷积神经网络和循环神经网络的视频行为检测方法
目前比较常用的人体行为检测方法都是基于3D 卷积网络的方法,但是3D 卷
说,3D 卷积网络一般很难体现一个相对来说比较好的时序的信息,为此我们提出
了一种融合3D 卷积网络和循环神经网络的视频行为检测方法。该方法能够充分
数据集,评价指标是检测问题常用的mAP(mean average precision) 值。实验结果
表明,基于3D 卷积网络和循环神经网络的方法在行为检测问题上要优于单纯的
只用3D 卷积网络的方法。
2) 提出了一种基于层次化注意力机制的行为识别方法
我们目前所做的基于RGB 视频的行为识别,与其说其是一个行为识别任务,
3) 提出了一种基于视频时序特征编码的行为识别方法
我们采用的数据集是UCF101 和HMDB51 数据集,实验结果表明,我们取得了当
关键词:行为识别,行为检测,双流网络,3D 卷积网络, 循环神经网络
In recent years, deep learning methods for human action recognition and detection
have been widely used in daily lives. Human behavior analysis and deep
learning theory are the hotspot of research in intelligent video analysis. They are
the basis of video surveillance, human-computer interaction and many other fields.
Recently they have received extensive attention from academia and engineering.
The analysis of human behavior consists of video pedestrian detection and human
behavior classification. The former is mainly to solve the problem of ”where” while
the latter is mainly to focus on the problem of ”what”. For the natural scenes in real
life or security surveillance videos, only to solve the two essential problems at the
same time, can truly solve all the demands for human behavior analysis. We start
from the detection of human action first, put forward our own methods. Mainly
focusing on how to locate the start time and the end time of the behavior precisely
and meanwhile classifying them correctly. Then there is a transition to the analysis
of short videos, which only give the action labels belong to them. In summary, the
contributions and the main work could be summarized as follows:
1) We propose a novel video action detection method which fuses convolutional
neural network and recurrent neural network together.
At present, the commonly used detection methods for video action detection
are based upon 3D convolution neural network. However there also exist some disadvantages which are 3D convolution network could only deal with short video clips,
as for those large scale video clips, 3D convolutional network is generally difficult to
reflect a relatively good information in the temporal dimension. Then we propose a
fusion of 3D convolutional network and recurrent neural network video action detection
method. This method could fully reflect the temporal information of different
time scale. We use the public dataset called THUMOS14 to evaluate our method,
the evaluation criteria is the commonly used mAP (mean average precision) value.
Experimental results show that our algorithm is superior to the method which only
use 3D convolution neural network.
2) We propose a hierarchial attention method for video action recognition.
Nowadays, we all focus on the task of action recognition based upon RGB
videos. It is more like a scene classification rather than an action recognition task.
So as long as we could extract those regions of interest, we could make some improvement.
Those regions of interest do good to help network learn what they need
to learn. In addition to object detection we can also use the methods based on
attention mechanism. The original attention mechanism only considers the highest
level of semantic information. However we propose a hierarchical attention model
which takes into account the different characteristics of different levels of granularity.
Experimental results demonstrate that our method is better than the original
attention model.
3) We propose a new human behavior analysis method based upon temporal
feature encoding.
Two stream methods are the current mainstream for video action recognition
based on RGB videos. They usually split video into two modalities。However, there
exists a drawback in original two stream network. The sampling strategy for the
network is all relied on single frame. Because of such sampling strategy, they are
typically limited to processing shorter sequences, which might cause the problems
such as suffering from the confusion by partial observation. So here we propose a
novel video feature representation method, called Deep Temporal Feature Encoding
(DTE). It could aggregate frame-level features into a robust and global video-level
representation. Such robust representation could still distinguish between those confusing categories. Comprehensive experiments are conducted on two public datasets: HMDB51 and UCF101. Experimental results demonstrate that DTE achieves the competitive state-of-the-art performance on both datasets.
Keywords: Video Action Recognition, Action Detection, Two-stream Network, 3D
Convolution Neural Network, Recurrent Neural Network

Document Type学位论文
First Author AffilicationInstitute of Automation, Chinese Academy of Sciences
Recommended Citation
GB/T 7714
李林. 基于深度认知神经网络的人体行为检测与识别方法研究[D]. 北京. 中国科学院研究生院,2018.
Files in This Item:
File Name/Size DocType Version Access License
毕业论文-人体行为分析.pdf(2872KB)学位论文 暂不开放CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[李林]'s Articles
Baidu academic
Similar articles in Baidu academic
[李林]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[李林]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.