CASIA OpenIR  > 毕业生  > 硕士学位论文
基于图的三维骨骼人体行为识别研究
王沛
学位类型工学硕士
导师胡卫明 ; 原春锋
2017-05
学位授予单位中国科学院研究生院
学位授予地点北京
关键词三维人体行为识别 骨骼运动 行为表示 图核
摘要

在人机交互、智能监控、视频检索等领域,人体行为识别具有很大的应用价值。它已吸引了众多研究者专注于这个方向的研究。近年来,基于RGB视频的人体行为识别的研究取得了很大进展,但是这个领域仍有很多问题需要解决,尤其是对于实时场景下的行为识别准确率不高。随着廉价深度传感设备的出现,越来越多的研究者关注于三维骨骼序列人体行为识别。尽管现有的算法在这个任务上取得了一定的成果,但是在算法的鲁棒性、识别的准确率以及算法的语义性等方面仍然有很多问题需要解决。

本文主要关注基于三维骨骼序列的人体行为识别,创新性地提出了基于图结构的骨骼行为表示及度量方式。本文首先提出了一种新的图核,它不仅可以更加有效地保留图的局部拓扑性,而且具有一定的可扩展性。接着我们分别提出了基于轨迹分割的骨骼序列无向图表示以及基于多视角的骨骼序列双图模型,并用所提出的图核度量了两种行为图之间的相似性从而完成了行为识别任务。通过大量的多样的实验,我们证明了所提出的算法的有效性。具体来说,论文的主要工作如下:

1)我们针对人体行为识别任务提出了一个新的图核,子图模式图核(Subgraph-pattern Graph Kernel, SPGK)。具体地,图被分解为一系列的子结构,我们称其为子图模式集(Subgraph-pattern Set,SPS)。SPS是一个共享中心节点的所有子图模式的集合。由于具有相对复杂的结构,它可以有效地挖掘图的局部拓扑信息。基于从两个图中所提取的子图模式集,我们利用动态规划算法,通过合并所有的从两个图中提取的子图模式集的相似性计算了最终的图核。我们的SPGK是一个可扩展的图核,它可以通过设置不同的基本核容易地转变为其它传统图核。我们使用图结构来建模视频中的行为。顶点对应局部描述子,边来度量它们之间的相似性。因此图中提取的SPS,即顶点和边的组合,可以被看作行为片段。SPS的高维拓扑结构可以充分地捕捉行为片段局部的时空信息。我们在几个公开的数据集上的实验证明所提出的方法超过了其它基于图的方法并且达到了与当时最新方法可比较的识别率。

2)目前存在的骨骼行为表示不仅不能有效地捕捉关节点运动的时空信息,而且对深度传感器和骨骼位置定位算法产生的噪声鲁棒性不足。基于这些问题,我们通过跟踪关节点的轨迹并且分割这些轨迹,提出了一个新的具有一定语义性的骨骼关节点行为底层表示,运动单元(motionlet)。在这个过程中,通过轨迹平滑、采样与分割,噪声的干扰得到了减少。然后我们组合了这些motionlet并将它们的时空关系保留在了边属性中,构建了一个无向全连接标签图来表示一个视频。进一步,子图模式图核被用来度量两个图之间的相似性。我们的图表示和图核均有很强的语义性。每一个顶点对应一个行为片段,顶点的相似性测量对应行为片段相似性的测量;每一个子图对应子行为,子图之间的比较对应子行为之间的比较。最后,SPGK直接作为SVM分类器的核对行为进行识别。为了评估我们所提出的方法,我们在几个公开数据集上完成了一系列的实验并且实现了最高的识别准确率。

3)我们对于人体骨骼行为序列提出了一个新的基于图的表示,多视角双图模型,并且提出了一个图核,分层树状模式图核(Hierarchical Tree-pattern Graph Kernel,HTPGK),来测量两个骨骼行为序列之间的相似性。具体地,我们将骨骼序列投影到不同的二维平面。在不同投影面的骨骼行为充分地记录了一个行为不同的运动信息。对于每一个投影面,我们提取了它们对应的底层特征motionlet。这些motionlet被合并为两类图,时间图和空间图。它们可以从时间因果关系和空间位置关系两个角度来捕捉行为内部的局部拓扑信息。为了有效地比较两个图的相似性,我们基于不同层次的子结构分解提出了HTPGK图核,它可以从不同的语义层来度量两个行为之间的相似性。最后,我们利用高效贝叶斯多核学习算法将这些不同投影角度、时间和空间、以及不同语义层次的信息进行了融合。将HTPGK 输入SVM分类器后,我们完成了三维骨骼行为识别任务。在UTKinect-Action3D数据库上我们做了丰富的实验,我们提出的算法不仅超过了最新的方法,而且得到了很多极具价值的结论。

其他摘要

Human action recognition harbors many valuable industrial applications such as human-computer interface, intelligence surveillance, video retrieval. It has drawn a great amount of interests in the research community. In recent years, although researchers have made a great progress on the RGB videos based action recognition, there are still many challenging problems which need to be solved, especially when addressing real-time scene action recognition task. With the prosperity of cost effective depth sensors, more and more researchers focus on skeleton based human action recognition. Although the existing methods for this task get some good performances, the limitations in algorithm robustness, the recognition accuracy as well as algorithm semantics are obvious.

This thesis mainly focuses on the area of skeleton based human action recognition, creatively proposing a graph based method for skeleton motion representation and measurement on action recognition. Firstly, we propose a novel graph kernel, which not only captures the local topology of graphs but has generalization to a certain extent. Then we propose trajectory segmentation based undirected graph representation and multiview two graph model respectively. By employing the proposed new graph kernel, we measure the similarities between two actions and thus complete the human action recognition task. In each section, we conduct a large number of experiments to demonstrate the effectiveness of our approaches.  More specifically, the main contributions of this thesis are summarized as follows:

(1) We propose a novel graph kernel, subgraph-pattern graph kernel (SPGK), for human action recognition. Specifically, graphs are decomposed into a series of substructures, called subgraph-pattern sets (SPSs). The SPS is a set of different subgraph-patterns sharing the same center node. Due to its relatively complex structures, it exploits the local topology of graphs effectively. Based on the obtained subgraph-pattern groups from two graphs, we compute the final graph kernel by incorporating all similarities among pairs of SPSs extracted from two graphs through dynamic programming algorithm. Our SPGK is a generalized graph kernel which could become other traditional graph kernel easily just by setting different basic kernels. We use graphs to model action videos, nodes corresponding to local interest points and edges corresponding to the relationships between them. Thus the SPSs of these graphs, the combination of some nodes and edges, can be viewed as action segments. High-order topological structures of SPSs are capable of capturing local spatial-temporal information of action segments sufficiently. Our experimental results demonstrate that the proposed method outperforms other graph based methods and achieves a comparable performance to the state-of-the-art approaches on several public datasets.

(2) Most of existing skeleton-based representations for action recognition can not effectively capture the spatio-temporal motion characteristics of joints and are not robust enough to noise from depth sensors and estimation errors of joints. Inspired by this fact, we propose a novel low-level representation for the motion of each joint through tracking its trajectory and segmenting it into several semantic parts called motionlets. During this process, the disturbance of noise is reduced by trajectory fitting, sampling and segmentation. Then we construct an undirected complete labeled graph to represent a video by combining these motionlets and their spatio-temporal correlations. Furthermore, the new proposed subgraph-pattern graph kernel (SPGK) is utilized to measure the similarity between graphs. Our graph representation and graph kernel both have rich semantics. Each node corresponds to an action segmentation and the comparison of them corresponds to that of action segmentations. Each sub-graph corresponds to a sub-action and the comparison of them corresponds to that of sub-actions. Finally, the SPGK is directly used as the kernel of SVM to classify videos. In order to evaluate our method, we perform a series of experiments on several public datasets and our approach achieves the best performance.

(3) We propose a novel graph based representation, multiview two graph model, for human skeleton sequences and an original graph kernel, hierarchical tree-pattern graph kernel (HTPGK), to measure their similarities. More specifically, we project the skeleton sequence into different and various planes. Skeletal actions in different projections record different motion information of actions efficiently. As for each projection, we extract their corresponding low-level feature, motionlet, mentioned above. These motionlets are combined into two types of graph model, temporal graph (TG) and spatial graph (SG), which capture the local feature and topological structure of actions from temporal causal and spatial local perspectives. In order to compare two graphs effectively, our propose graph kernel, HTPGK, is based on different substructure decomposition, which measure the similarities of two graphs from different semantic levels. Finally, we employ the Bayesian effective multiple kernel learning to fuse these various projection angles, time and space, as well as a variety of semantic levels features and feed the HTPGK into the SVM to classify actions. An extensive evaluation is performed on UTKinect-Action3D datasets, where our proposed method not only outperforms the state-of-the-art approaches but also leads to several valuable results.

文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14676
专题毕业生_硕士学位论文
推荐引用方式
GB/T 7714
王沛. 基于图的三维骨骼人体行为识别研究[D]. 北京. 中国科学院研究生院,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
硕士学位论文 王沛.pdf(8809KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[王沛]的文章
百度学术
百度学术中相似的文章
[王沛]的文章
必应学术
必应学术中相似的文章
[王沛]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。