基于动作演变的人体行为识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于动作演变的人体行为识别研究
	王洪松
	2018
学位类型	工学博士
中文摘要	运动仿真与合成等领域都有应用。随着数码相机和互联网的广泛使用，行为识别研究由之前受控的场景转移到现在真实的场景中。基于局部动态特征编码的方法把视频中的所有局部特征看作一个集合并采用一个编码来描述整个视频，因此不能描述行为的全局运动和动作在时间上的演变。基于卷积神经网络（Convolutional Neural Networks，CNN ）的方法训练时采用单帧或者很短的序列作为输入，因此不能学习到视频中的全局运动特征。随着深度传感器和人体姿态估计算法的发展，从深度图像中可以实时准确地估计人体关节点的坐标，因此基于人体关节点的行为识别的研究变得流行起来。一些端到端的基于递归神经网络（Recurrent Neural Networks，RNN）的方法被用来从原始的关节点数据中学习行为的表示并直接预测行为类别，但这些方法只考虑了孤立的关节点坐标随着时间的变化特性，忽略了关节点之间的空间位置关系和几何结构关系。针对前人研究存在的以上问题，本文主要研究基于动作演变的人体行为识别，具体工作如下：为了学习视频中的全局运动特征，我们提出基于层次化特征演变的行为识别模型。模型分为两个主要步骤：单帧特征表示和层次化特征演变模型，其中，单帧特征表示考虑了两类特征：局部动态特征和场景特征。单帧局部动态特征表示采用改进的密集轨迹提取局部动态特征，并对视频的每一帧进行编码。单帧场景特征表示首先在大规模图像场景识别数据库上训练CNN模型，然后用训练好的CNN模型对视频帧提取场景特征。层次化特征演变模型首先把视频划分为多个片段，并对每个片段采用一个排序模型学习帧之间的序列关系，并把排序模型的参数作为片段的局部动作描述。然后，我们对这一系列时间上的局部动作再次采用一个排序模型学习片段之间的序列关系，并把排序模型的参数作为整个视频中全局运动的表示。为了利用人体关节点之间的空间位置关系，我们提出基于双流RNN的人体关节点时间演变和空间关系的行为识别模型。该模型有两个通道：时间通道和空间通道。时间通道采用基于RNN的方法学习人体关节点在时间上的变化特性，我们比较了两种RNN结构：多层RNN和层次化RNN。空间通道是为了学习人体不同关节点在空间上的依赖关系，考虑到不同关节点根据物理连接可以组成一个图的结构，我们提出两种把关节点图转化为关节点的序列的方法：关节点的链条序列和关节点的遍历序列，并把关节点的序列作为空间RNN的输入。为了防止网络训练过拟合并提高模型的泛化性能，我们提出三种基于3D坐标变换的数据增强的方法：旋转变换、尺度变换和剪切变换。为了利用相邻关节点之间的几何关系，我们提出基于关节点几何演变的行为识别和检测模型。我们把行为看作是由人体关节点组成的图的序列，根据身体的物理结构设计了三类几何输入：节点、边和平面。节点是孤立的人体关节点的坐标，边是连接物理上相邻的两个关节点的人体骨骼，可以采用这两个关节点的相对坐标来表示，平面是由物理上相邻的两个骨骼所在的直线所形成的平面，它表示人体部件，采用平面的法向量来表示。我们分别推导节点、边和平面在3D空间的旋转矩阵，并发现对于同一个序列，它们的旋转矩阵是相同的。我们把该模型用到行为识别和行为检测两个任务中，并提出了一个包含视角转换层的基于RNN的结构。对于行为检测，我们首先采用上述基于RNN的结构做单帧的行为分类，然后，提出一种新的多尺度滑动窗口搜索算法根据单帧的预测概率判断序列中包含的行为类别及对应的起始帧和结束帧的位置，该算法可以检测到持续时间任意长的行为。
英文摘要	Action recognition in videos has been an active research area due to its potential applications in video surveillance, video indexing, human computer interaction, etc. Due to the explosive growth of internet videos, the attention of research has shifted from simple actions in controlled environments to complex actions in realistic scenarios. The bag-of-words framework based on local spatio-temporal features and the end-to-end Convolutional Neural Networks (CNN) based approaches are very popular for action recognition in realistic videos. However, the local features based methodes fail to describe the long-term temporal evolution of features since a single representation is used to encode the bag of local features. The CNN based approaches also cannot capture global temporal motions as they only compute motion features in short time windows. With the recent advent of cost-effective depth sensors coupled with real-time skeleton estimation algorithms, skeleton based action recognition gains more popularity. Recently, there is a growing trend of using end-to-end Recurrent Neural Networks (RNN) for skeleton based action recognition. However, these RNN based methods only model the contextual information in the temporal domain by concatenating skeletons for each frame, and neglect the spatial configurations and geometric relations of joints. In order to address the above limitations, we aim to model global motion evolution for action recognition from the following three aspects: We propose a novel hierarchical scheme to learn better video representation. The method consists of two steps: frame-wise feature representation and hierarchical encoding. For the frame-wise feature representation, we introduce two kinds of features: local spatio-temporal features and CNN based features. For each frame, we use bag-of-words framework to encode the local features, and employ a CNN model trained from a scene-centric database to predict scene responses. For the hierarchical encoding, we first use different ranking machines to learn motion descriptors of local video clips. Then, in order to model motion evolution, we encode features obtained in the previous layer again using a ranking machine. We propose a novel two-stream RNN architecture to model both temporal dynamics and spatial configurations for skeleton based action recognition. We explore two different structures for the temporal stream: stacked RNN and hierarchical RNN. Hierarchical RNN is designed according to human body kinematics. We also propose two effective methods to model the spatial structure by converting the spatial graph into a sequence of joints. To improve generalization of our model, we further exploit 3D transformation based data augmentation techniques including rotation and scaling transformations to transform the 3D coordinates of skeletons during training. We propose a novel model to learn representations from primitive geometries for skeleton based action action recognition by leveraging the geometric relations among joints. We first introduce three primitive geometries: joints, edges and surfaces. Then, a generic end-to-end RNN based network is designed to accommodate the three inputs. For action recognition, a novel viewpoint transformation layer and temporal dropout layers are utilized in the RNN based network to learn robust representations. And for action detection, we first perform frame-wise action classification, then exploit a novel multi-scale sliding window algorithm.
关键词	行为识别全局运动特征时间演变空间关系几何输入
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/21040
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
第一作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	王洪松. 基于动作演变的人体行为识别研究[D]. 北京. 中国科学院大学,2018.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于动作演变的人体行为识别研究.pdf（9179KB）	学位论文		限制开放	CC BY-NC-SA