CASIA OpenIR  > 毕业生  > 博士学位论文
基于时空模型的行为识别研究
曹聪琦1,2
学位类型工学博士
导师卢汉清
2018-05-28
学位授予单位中国科学院大学
学位授予地点北京
关键词行为识别 时空模型 特征提取 序列建模 深度学习 卷积神经网络 循环神经网络
摘要行为识别是计算机视觉研究领域的一个重要分支,在无人驾驶、人机交互、运动分析合成、智能视频监控以及基于内容的视频检索等领域具有广泛的应用前景。其研究内容主要是利用机器学习的方法使得计算机可以自动分析和理解摄像机拍摄的视频中人在做什么。
视频中的人体行为往往存在运动模糊、光照变化等情况,如何提取具有判别力和鲁棒性的时空特征对于后续识别任务是至关重要的。识别人体行为不仅要理解和识别人的个体动作,还需要理解和识别人与周围环境、人与人之间的交互关系等。对于由同时发生或者顺序发生的多个子动作组成的复杂行为活动,还需要建模子动作之间在空间时间上的依赖关系。对于计算机系统而言,考虑到人体行为的复杂性、周围环境的多样性、运动习惯方面的差异性等,准确理解和分析视频中的人体行为具有很大的挑战。视频中的人体行为识别研究主要集中于特征构造、序列建模两大研究方向。本文总结分析了已有的研究工作,针对存在的问题,提出更加具有表达力与判别力的时空模型。论文的主要工作和创新点归纳如下:
第一,基于概率图模型的行为识别。针对多模态学习与组合行为识别问题,提出两种基于概率图模型的序列识别模型。具体而言,提出一种基于耦合隐马尔科夫模型的多模态手势行为识别方法,在模型层级融合多种模态的数据,利用耦合的隐马尔科夫模型发掘不同模态数据间的关联特性与互补信息。对于测试数据只包含一种模态数据的情况,通过概率计算,将根据多模态数据训练得到的模型参数迁移到单链模型上。多个数据集上基于不同模态组合的实验结果验证了所提模型的有效性。提出一种基于时空三角链式条件随机场的组合行为识别方法。将由多种子行为在空间和时间上组合构成的复杂行为识别问题,看作多层级序列标注问题。给定观测序列,同时预测序列的行为类别以及每一时刻的子行为类别。将传统的时序三角链式条件随机场在空间维度进行扩展,使用多条链建模人体行为中不同身体部位子行为的时空依赖关系。所提模型较之已有方法考虑了更多的依赖关系。组合行为数据集上的实验结果验证了所提模型的有效性和鲁棒性。
第二,基于门控卷积神经网络的行为识别。针对骨骼序列识别问题,提出一种基于门控卷积神经网络的行为识别模型。在分析对比各类深度模型特性的基础上,提出将序列识别的问题转换为图像分类的问题,设计了一种加入线性跳跃门控连接的卷积神经网络识别骨骼序列中的人体行为。将视频中的人体骨骼序列表征为一张对时空信息进行编码的彩色图像。在生成图像的过程中,考虑了骨骼节点不同的排列顺序对识别结果的影响,并且加入置换网络自动针对具体输入学习最优的排列方式。此外,改进了门控卷积神经网络中的门控单元,提出一种更有利于模型训练过程中梯度反传的线性跳跃门控连接。
第三,基于骨骼点与三维卷积神经网络的行为识别。针对行为视频的时空特征表示问题,提出一种基于骨骼点池化三维卷积神经网络特征的行为识别方法。利用骨骼点位置信息池化三维卷积神经网络的卷积层,生成具有判别力的视频描述子。提出一种考虑了卷积网络各层所使用的运算核尺寸、运算核步长以及填充边距值的坐标点映射方法,据此可以得到比按比例缩放更为准确的对应点位置信息以及更为精准的特征表示。无论是基于标注的骨骼点数据还是基于估计得到的有偏差的骨骼点数据,在多个数据集上的实验结果验证了所提特征的有效性、判别力和鲁棒性。为了提高模型的通用性,进一步提出一种基于骨骼点指导的两流双线性三维卷积神经网络模型。该模型可以自动从训练集的骨骼点数据中学习得到关键点位置知识并且提取时空特征,将特征池化过程表示为双线性乘积运算, 模型整体可以进行端到端联合优化。所提网络可以有效地迁移到不具有骨骼点标注信息的数据库或者数据规模过小不足以训练深度网络的数据库上。多个数据库上的实验结果表明,所提模型可以不依赖于复杂的骨骼点估计算法,端到端地实现基于关键区域的时空特征提取。
第四,第一视角交互手势行为识别研究。针对第一视角行为识别问题,提出一种基于循环时空变换模块与循环三维卷积神经网络的第一视角交互手势行为识别方法,重点解决第一视角中由头部运动带来的识别困难和挑战,利用三维卷积神经网络和具有循环连接的时空变换模块对特征图进行单应变换,提取更加具有判别力的时空特征,并且通过循环神经网络充分发掘时间序列的长时短时依赖关系。针对现有数据量不足的问题,设计、采集、 标注了一个大规模多模态的第一视角交互手势行为数据库,实现了各种基于不同模态的传统手工构造特征、二维卷积神经网络、三维卷积神经网络、循环神经网络、时空变换模块等模型的分类、检测算法。探索了不同场景下模型的表达能力与迁移能力。

其他摘要Recognizing the action performed in video is one of the most popular research fields in computer vision due to its promising applications in autonomous vehicles, human-computer interaction, kinematic analysis and synthesis, intelligent video monitoring, content-based video retrieval and so on. Action recognition aims to analyze and understand what the humans are doing in videos automatically by computers with machine learning approaches.
Since action recognition can be affected by motion blurs and illumination changes in videos, it is crucial to extract discriminative and robust spatio-temporal features. Recognizing action requires not only recognizing the individual movements of a single subject, but also recognizing the interaction relationships of subjects and surroundings. For composable activity which is composed of temporal and spatial arrangement of simple actions, recognizing such complex activities requires recognizing individual actions and capturing their spatio-temporal relationships. Due to the complexity of actions, variety of surroundings and diversity of movements, it is challenging for computers to understand and analyze actions in videos correctly. The research in action recognition mainly focus on feature extraction and sequence modeling. To deal with the challenges and issues of the existing methods, this thesis proposes more powerful and discriminative spatio-temporal models for action recognition. The main contributions are summarized as follows:
1) Two sequence learning frameworks based on graphical models are proposed respectively for multi-modal learning and composable action recognition. Specifically, a multi-modal learning framework with model-level fusion strategy is proposed for multi-modal gesture recognition. A coupled hidden Markov model (CHMM) is employed to discover the correlation and complementary information across different modalities. When there is only one modality available during testing, the combined transition probabilities learned from multiple modalities are transferred to single chain transition probabilities through probability computing. Experiments on real-world gesture datasets with different combinations of modalities demonstrate the effectiveness of the proposed multi-modal learning framework. Besides this, a framework which could model the spatio-temporal relationships of multilevel labels jointly in a unified hierarchical model is proposed for composable action/activity recognition. We expand traditional temporal Triangular-Chain CRF (TriCRF) to the spatial dimension corresponding to multiple body parts, obtaining the spatio-temporal TriCRF (ST-TriCRF).The model could both explicitly encode dependencies and preserves uncertainty between actions and activity. More spatio-temporal relationships are taken into consideration compared with competing methods. Experiments on composable human activity dataset demonstrate the effectiveness and robustness of the proposed framework.
2) A classification framework with gated convolutional neural network is proposed for skeleton-based action recognition. Based on the analysis of different neural networks, we choose to solve the sequence learning problem as an image classification task using convolutional neural networks. For better learning ability, we build a classification network with stacked residual blocks and having a special design called linear skip gated connection which can benefit information propagation across multiple residual blocks. When arranging the coordinates of body joints in one frame into a skeleton feature, the performance of different arrangement orders is systematically investigated. Furthermore, a fully-convolutional permutation network is designed to learn an optimized order for data rearrangement. Without any bells and whistles, the proposed model achieves state-of-the-art performance on benchmark datasets, outperforming existing methods significantly.
3) An efficient way of pooling activations in 3D feature maps based on body joint positions is proposed to generate video descriptors for action recognition. The features are called joints-pooled 3D deep convolutional descriptors (JDDs). A novel method is used to map the body joint positions in videos to points in feature maps for pooling by taking kernel sizes, stride values and padding sizes of 3D CNN layers into account which is more appropriate than directly using ratio scaling. Promising experimental results in multiple datasets with no matter annotated or estimated body joints demonstrate the effectiveness and robustness of JDD in video-based human action recognition. In order to improve the generality of the method, a two-stream bilinear model which can learn the guidance from body joints automatically (by a 3D attention stream) and capture the spatiotemporal features simultaneously is proposed. The pooling process in the proposed two-stream bilinear 3D CNN is formulated as a generalized bilinear product operation, making the model end-to-end trainable. The 3D attention stream can be transferred effectively to the datasets which are not annotated with body joints or too small to train a deep network. Experiments demonstrate the effectiveness and robustness of the proposed model in video-based action recognition which is independent of complex skeleton estimation algorithms.
4) A novel end-to-end trainable recurrent 3D convolutional neural network which can deal with the egocentric motion effectively is proposed for egocentric gesture recognition. A spatio-temporal transformer module with recurrent connections (RSTTM) between neighboring time slices is specially designed. The proposed RSTTM can actively transform a 3D feature map into a canonical view in both spatial and temporal dimensions. We further extend spatial affine transformers to spatiotemporal homography transformers for better learning ability. To handle the issue of insufficient data, a new benchmark dataset named EgoGesture with sufficient size, variation and reality is introduced. The performances of several representative approaches (hand-crafted features based on different modalities, 2D CNN, 3D CNN and RNN etc.) are systematically evaluated on two tasks: gesture classification in segmented data and gesture detection in continuous data. An in-depth analysis on model selection and domain adaptation between different scenes is provided.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/20943
专题毕业生_博士学位论文
作者单位1.中国科学院自动化研究所
2.中国科学院大学
推荐引用方式
GB/T 7714
曹聪琦. 基于时空模型的行为识别研究[D]. 北京. 中国科学院大学,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis-20180530-签字版.(13483KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[曹聪琦]的文章
百度学术
百度学术中相似的文章
[曹聪琦]的文章
必应学术
必应学术中相似的文章
[曹聪琦]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。