CASIA OpenIR  > 模式识别国家重点实验室  > 视频内容安全
基于深度学习的人体行为识别研究
杨浩
Subtype博士
Thesis Advisor胡卫明 ; 原春峰
2019-05-25
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Name工学博士学位
Degree Discipline模式识别与智能系统
Keyword深度学习 卷积神经网络 递归神经网络 行为识别
Abstract

    人体行为识别是指对视频序列中包含的人体行为进行分类,其在人机交互、智能监控、生活辅助及虚拟现实等领域有着广泛的应用前景。长期以来人体行为识别都是模式识别和计算机视觉领域一个热门的研究课题。传统的行为识别方法将行为识别分为手工特征设计与提取、行为分类两个过程。而这两个过程相互独立,使得行为识别的过程更加繁琐,且无法从视频中提取到具有判别性的时空特征。近年来,深度学习在语音识别、机器翻译及计算机视觉等诸多领域取得了突破性的进展。本文充分利用卷积神经网络、递归神经网络等深度学习方法各自的优势,提出了多种端到端的、高效的、鲁棒的行为识别深度模型。本论文的主要工作和贡献概括如下:

1.提出了一种基于时空注意机制的卷积神经网络并将其应用于行为识别

    为了排除视频中的干扰信息,本文提出了一种基于时空注意机制的卷积神经网络(Spatial-Temporal Attention Convolutional Neural Network,STA-CNN)。该STA-CNN模型包含空间注意机制和时间注意机制,其中,空间注意机制使网络可以关注到运动显著的空间区域和具有判别力的非运动空间区域。时间注意机制使网络可以自动地从长时复杂视频中挖掘出具有判别力的时域片段。STA-CNN模型将空间注意机制和时间注意机制融合到统一的卷积神经网络框架中,并实现端到端训练。本文提出的STA-CNN模型在两个最具挑战性的行为识别数据库上均取得了当前最好的分类效果。

2.提出了一种增大差异的卷积神经网络集成方法并将其应用于行为识别

    传统的网络集成方法训练的子网络具有相同的结构,并在统一的数据库上训练得到,使得子网络之间差异性小、互补性差,集成后对识别精度的提升非常有限。本文提出了一种增大差异的卷积神经网络集成方法(Diversity Encouraging Ensemble,DEE)。一方面,在训练过程中调整各个子网络的结构参数,从而增大子网络之间的差异性和互补性。另一方面,重复利用网络的中间状态和单调递减的学习率,可以大幅减少集成网络的训练时间。本文提出的DEE网络集成方法在两个最具挑战性的行为识别数据库上均获得了当时最好的分类效果。

3.提出了一种基于序列卷积神经网络的行为识别方法

卷积神经网络擅于抽象空间表观特征,而递归神经网络擅于建模时域动态关系。基于此,本文提出了一种序列卷积神经网络(Sequential Convolutional Neural Network,SCNN)。该SCNN网络充分融合了卷积神经网络和递归神经网络各自在表观抽象和时域建模中的优势。传统的递归神经网络只能处理向量化的特征表示,序列卷积神经网络将递归神经网络结构中的全连接替换成卷积连接,从而可以直接处理二维的图像或者特征。并且卷积运算的局部权值共享特性可以有效减少序列卷积神经网络的参数量和计算量。本文提出的序列卷积神经网络在所有同类的行为识别方法中取得了最高的识别精度。

4.提出了一种基于非对称三维卷积神经网络的行为识别方法

    为了克服三维卷积参数多、计算复杂度高、训练困难等问题,本文提出了一种高效的非对称三维卷积,即用三个方向的非对称三维卷积近似传统的三维卷积。然后,本文融合多个不同尺度的三维卷积分支结构得到非对称三维卷积局部网络。最后,本文通过堆叠多层非对称三维卷积局部网络得到三维卷积深度网络,并将其应用于行为识别任务。实验结果表明,本文提出的非对称三维卷积神经网络在速度和精度上均超过了基于传统三维卷积神经网络的行为识别方法。

    总的来说,本论文以解决自然场景中行为识别任务的实际困难为目标,利用当前流行的深度学习方法,针对当前行为识别方法中存在的问题提出了卓有成效的解决方法。并且针对不同神经网络模型提出了多种降参提速的方法,拉近了行为识别研究与实际应用的距离。

Other Abstract

Human action recognition is classifying videos, each of which contains a certain action. It has great potential in many applications, such as, human-computer interaction, intelligent surveillance, living assistance and virtual reality. Action recognition has been a hot topic in pattern recognition and computer vision area for decades. Traditional action recognition methods divide the action recognition as two stages, i.e., hand-crafted features extracting and actions classifying. The two stages are separate from each other, which makes the action recognition process more complex and it cannot extract discriminative spatial-temporal features from videos. In recent years, deep learning has achieved great successes in speech recognition, machine translation and computer vision. This thesis exploits the strength of deep learning methods (Convolutional Neural Network and Recurrent Neural Network) and proposes multiple end-to-end trainable, effective and robust action recognition deep models. The main works and contributions of this thesis are summarized as follows:

1. Spatial-Temporal Attentive Convolutional Neural Network

A Spatial-Temporal Attentive Convolutional Neural Network (STA-CNN) is proposed to eliminate the interference information in videos. The STA-CNN model includes a Spatial Attention Mechanism and a Temporal Attention Mechanism. The Spatial Attention Network can locate the motion salient spatial regions and the discriminative non-moving spatial regions. The unsupervised Temporal Attention Mechanism automatically mines the discriminative temporal segments from long and noisy videos. The STA-CNN model incorporates the Spatial Attention Mechanism and the Temporal Attention Mechanism into an unified convolutional network and it is end-to-end trainable. The STA-CNN model has achieved the state-of-the-art performance on two of the most challenging action recognition datasets, UCF-101 and HMDB-51.

2. Diversity Encouraging Ensemble of Convolutional Networks

The subnetworks of traditional ensemble methods have same architecture and they are trained in the same dataset, so the subnetworks have small diversity and complementary. In this work, an efficient ensemble method, Diversity Encouraging Ensemble (DEE), is proposed. On the one hand, the proposed DEE method modifies the structure parameters of subnetworks in training to enlarge the diversities of the subnetworks. On the other hand, the DEE method reuses the intermediate state of the network and exploits monotonous decreasing learning rate schedule to decrease the training time of the subnetworks significantly. The proposed  DEE method had achieved the state-of-the-art performance on UCF-101 and HMDB-51 datasets.
 

3. Sequential Convolutional Neural Network

The CNN has strong abilities in abstracting spatial information, while the RNN is good at modeling temporal dependencies from videos. In this work, a Sequential Convolutional Neural Network (SCNN) is proposed. The SCNN model incorporates the strengths of both convolutional operation and recurrent operation to extract effective spatial-temporal features from videos directly. The proposed SCNN model extends RNN to directly process video frames or feature maps, rather than vectors flattened from them, to keep spatial structures of the input videos. It replaces the full connections of RNN with convolutional connections to decrease parameter numbers, computational cost, and over-fitting risk. The proposed SCNN deep model  has achieved the best performance compared with the recurrent network based action recognition methods.

4. Asymmetric 3D Convolutional Neural Networks

The 3D convolution is much more expensive on computation, costly on storage, and difficult to learn. This work proposes an efficient asymmetric 3D convolution, and it exploits three one-directional asymmetric 3D convolutions to approximate a traditional 3D convolution. To improve the feature learning capacity of asymmetric 3D convolutions, a set of local 3D convolutional networks, called MicroNets, are proposed by incorporating multi-scale 3D convolution branches. Then, an asymmetric 3D-CNN deep model is constructed by cascading several MicroNets for the action recognition task. The asymmetric 3D-CNN model outperforms all the traditional 3D-CNN models in both effectiveness and efficiency on UCF-101 and HMDB-51 datasets.

In all, this thesis focuses on the human action recognition in real scene. It exploits the deep learning methods and proposes several deep models to solve the problems in current action recognition methods. Moreover, this work proposes several methods to reduce the computational cost of these deep models, and it shortens the distance between research and applications.

Pages152
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23871
Collection模式识别国家重点实验室_视频内容安全
Corresponding Author杨浩
Recommended Citation
GB/T 7714
杨浩. 基于深度学习的人体行为识别研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.
Files in This Item:
File Name/Size DocType Version Access License
基于深度学习的人体行为识别研究.pdf(16833KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[杨浩]'s Articles
Baidu academic
Similar articles in Baidu academic
[杨浩]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[杨浩]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.