基于深度表示学习的行为识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度表示学习的行为识别研究
其他题名	Deep Representation Learning for Action Recognition
	杜勇
	2016-05-29
学位类型	工学博士
中文摘要	行为识别是计算机视觉领域的一个重要分支，在机器人视觉、智能视频监控、人机交互、医疗护理、虚拟现实及游戏控制等领域有着广泛应用前景，其研究目的主要是让计算机通过摄像机理解视觉场景中的人在干什么。为避免传统行为识别方法中繁琐的特征提取及选择编码过程，本论文结合卷积神经网络和递归神经网络分别在空间结构和时变动态信息提取方面的优势，针对行为识别研究中基于人体骨架的行为识别和基于RGB视频的行为识别两个研究问题，基于深度学习构建模型以自适应提取序列中的时空信息表达，基于获取的表达来解决这两类行为识别问题。本论文主要工作概括如下： 1. 通过将人体骨架序列转化为对应的图像表达，利用卷积神经网络提取其中的空间结构信息以间接获取原始骨架序列的时空信息表达，在此基础上解决行为识别问题。该模型是一种端到端的、简单、高效、高精度的基于人体骨架序列的行为识别模型。 2. 将人体结构的物理相关性约束同递归神经网络结构设计结合起来，提出层级化递归神经网络模型，通过局部特征提取及层级化特征融合来获取骨架序列中的时空信息表达，从而解决单视角场景下基于人体骨架序列的行为识别问题。随后根据该模型特点，在其训练过程引入随机旋转及尺度变，使网络通过对一定范围内任意视角下人体运动的时变动态分析来自适应学习行为类别独立于视角变化的运动模式，以解决多视角场景下基于人体骨架序列的行为识别问题。总体上，该模型是一种端到端的、高精度、高效率的基于人体骨架序列的行为识别模型，并对输入噪声、局部遮挡及摄像机视角变化具有很强的鲁棒性。 3. 基于卷积神经网络和递归神经网络分别在提取空间静态和时变动态信息表达方面的优势，结合门控思想来解决递归神经网络训练中的梯度消失和误差膨胀问题，提出卷积递归神经网络模型来同步自适应提取视频中更具区分性的时空信息表达，以更好地解决基于RGB视频的行为识别问题。
英文摘要	As an important branch of computer vision, action recognition has a wide range of applications, e.g., robot vision, intelligent video surveillance, human-computer interaction, medical care, virtual reality, game control, etc.. The research objective of action recognition is to make the computer understand what people are doing in the front of cameras. To avoid the cumbersome processing of feature extraction and encoding in traditional action recognition approaches, we propose three deep learning models by combining the advantage of Convolutional Neural Network (CNN) in extracting the spatial information in structural data and that of Recurrent Neural Network (RNN) in modelling the temporal dynamics in sequences. All the three models can adaptively extract the spatial-temporal representations from sequences, which are beneficial to action recognition. This thesis is summarized as follows: 1. After representing the given human skeleton sequence as a special image, a CNN based model is proposed to extract the spatial structure information, so as to indirectly obtain the spatial-temporal representation of the original sequence and recognize actions. This model is an end-to-end, simple, but higher accuracy and more effective solution for skeleton based action recognition. 2. Inspired by the physical structure and motion characteristics of humans, we provide a hierarchical recurrent neural network by combining the physical correlations and constraints of human bodies, which can partially extract and hierarchically fuse local features for recognizing actions from a single viewpoint. Then according to the performance of the proposed model, we introduce the random scale and rotation transformation during training to motivate the model to adaptively learn the inherent motion patterns of actions from the variable viewpoints within a certain range, which are independent to the variable viewpoints and benefit to multi-view skeleton based action recognition. Overall, this is an end-to-end, high accuracy and high efficiency model for skeleton based action recognition. Meanwhile, it is very robust to input noise, partial occlusions, and the variable viewpoints. 3. Taking the advantage of CNN in spatial information extraction and that of RNN in temporal dynamic representation learning, and combining the gated-structure derived from Long-Short Term Memory to overcome the vanishing gradient and error blowing up problems during the training of traditional RNNs, we propose a convolutional recurrent neural network to synchronously extract the spatial and temporal representations from videos. Experimental results demonstrate that this model can obtain more discriminative representations for RGB video based action recognition.
关键词	深度学习行为识别 Cnn Rnn Lstm
学科领域	计算机视觉与模式识别
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11693
专题	毕业生_博士学位论文
作者单位	中科院自动化研究所模式识别国家重点实验室
第一作者单位	模式识别国家重点实验室
推荐引用方式 GB/T 7714	杜勇. 基于深度表示学习的行为识别研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于深度表示学习的行为识别研究.pdf（14575KB）	学位论文		限制开放	CC BY-NC-SA