CASIA OpenIR  > 精密感知与控制研究中心  > 精密感知与控制
基于深度学习的人体行为识别研究
朱佳刚
Subtype博士
Thesis Advisor邹伟
2020-05-25
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Discipline控制理论与控制工程
Keyword行为识别 双流网络 人体关节点 多任务学习 行为识别系统
Abstract

人体行为识别是计算机视觉中的一个热点问题,在智能视频监控、病人监护、人机交互、虚拟现实、智能家居、智能安防、运动员辅助训练等领域有着潜在应用价值。最近,行为识别领域中的深度学习方法发展迅速,但无论是基于RGB图像的方法还是基于人体关节点的方法都存在一定的缺陷,其中,基于双流网络的方法在模型输入、特征聚集以及决策融合上仍有不足;基于人体关节点的方法不能有效提取全局关节点关系;基于RGB图像的行为识别和基于人体关节点的行为识别存在一定的互补性,如何结合这两个任务仍值得尝试。本文针对上述问题通过从视频特征有效提取、多通道识别预测融合策略、单元全局信息提取以及多任务交互增强等方面入手,对基于深度学习的人体行为识别方法进行研究,具有重要的理论研究价值。论文主要工作和贡献包括:

(1)针对基于双流网络的方法训练时使用过少的帧表示一个视频导致训练不充分,缺少端到端训练,特征为单一时间尺度且维度较高的问题,提出一种端到端的视频级别特征学习方法-时间金字塔池化深度网络(Deep networks with Temporal Pyramid Pooling,DTPP)。该网络从视频中稀疏采样足够多的RGB图像和光流序列图像作为输入,使用时间金字塔池化层将帧级别特征聚集成具有多时间尺度的视频级别特征,该特征对时间顺序敏感且维度较低;通过对比实验对DTPP算法的性能进行了充分验证,结果表明无论是通过ImageNet预训练还是Kinetics预训练,DTPP在UCF101和HMDB51上均能达到最好效果;对DTPP在HMDB51上的表现进行可视化分析,发现在具有时间顺序模式的视频中,DTPP的表现优于TSN(Temporal Segment Network,时间片段网络)。

(2)针对双流网络输出融合方式采用固定加权平均导致其对于视频数据适应性较差的问题,提出一种门控双流网络-Gated TSN。该网络主要由双流网络和门网络组成,其中双流网络选择为TSN,两个预测向量的融合方式利用门网络实现为门控融合;门网络利用双流网络中间卷积层的融合特征作为输入,具有两个独立的全连接层,一个用来输出融合权重,一个用来学习行为分类,可以以多任务的形式进行训练。通过实验表明所提融合方法在一定程度上优于固定加权平均的融合方式。

(3)针对基于循环神经网络、卷积神经网络以及图卷积神经网络进行行为识别的方法不能有效提取全局关节点关系的问题,提出一种考虑全局关节点关系的卷积关系网络进行基于人体关节点的行为识别。卷积关系网络利用组对应空洞1×2卷积聚集骨架图中的单元对信息,通过罗列所有的成对关系,单元对交互可以被显式和全局捕捉。引入时间、空间和通道注意力机制降低过拟合。通过融合分别以节点、边、节点帧差以及边帧差为输入的4个卷积关系网络的预测得到最终预测。通过实验对模型进行了混淆分析、耗时分析和单类别准确率可视化分析。对比实验表明:本文所提卷积关系网络在行为识别数据集NTU RGB+D和Kinetics上的表现均优于其它方法。

(4)为了结合基于RGB图像和基于人体关节点的行为识别方法的优势,提出一种多任务的行为识别算法-Action Machine(动作机)。该算法包含可进行联合训练的3个任务:基于RGB图像的行为识别、人体关节点估计和基于人体关节点的行为识别。Action Machine使用I3D提取视频特征,利用RoIAlign可以得到任意数量的人体区域特征,同时用于进行基于RGB图像的行为识别和人体关节点估计。人体关节点估计子网络在人体区域特征上进行时间共享操作,输出多帧人体关节点。将关节点序列看成图像,并作为一个CNN的输入进行基于人体关节点的行为识别。两个行为识别分支的预测相加得到最终的行为预测。在人体关节点数据集COCO和人体行为识别数据集NTU RGB+D、N-UCLA、MSR Daily和AVA上验证了算法有效性。在跨数据集实验中,所提方法准确率比基线模型I3D高7-10%,表明Action Machine不容易过拟合到物体和场景,对于不同场景具有很好的泛化性。

(5)实现了一个能够应对多人自然场景的在线行为识别系统。该系统包含人体检测模块、人体跟踪模块和行为识别模块,其中人体检测使用SSD-MobileNetV2,人体跟踪使用Deep Sort多目标跟踪算法,行为识别部分采用所提多任务行为识别算法Action Machine。该系统能够同时检测和跟踪里面的人并对其进行在线行为识别。在NVIDIA GeForce RTX 2080Ti GPU和嵌入式平台Jetson Nano上进行了部署,通过加速优化,使得该系统能够高效地在各个硬件平台上运行。该系统能够在在室内场景和室外场景平稳地进行行为识别。

总体来看,第一项和第二项工作围绕基于全图的行为识别,结合双流网络的工作特点对如何提高其性能开展研究,其中:第一项工作针对双流网络的输入和特征聚集等方面的缺陷,提出一种端到端的视频级别学习方法;第二项工作针对双流网络的融合权重固化问题,提出基于门网络的权重自适应融合机制。基于全图进行行为识别的模型容易过拟合场景和物体,因此在第三和第四项工作中,我们专注于对视频中的人进行分析,其中:第三项工作是针对当前基于人体关节点进行行为识别的方法不能有效提取全局关节点关系的问题,提出一种考虑全局关节点关系的行为识别方法;第四项工作综合RGB图像和人体关节点的优势,提出一种多任务行为识别方法;第五项工作利用第四项工作中的行为识别方法,结合人体检测和人体跟踪,在嵌入式设备上实现了一个行为识别系统。

Other Abstract

Action recognition in videos has been an active topic in computer vision due to its potential applications in video surveillance, patient monitoring, human computer interaction, virtual reality, smart home, intelligent security and supplementary training for athletes. Recently, deep learning methods are very popular in action recognition. However, RGB-based methods and pose-based methods both have some drawbacks: Two-stream CNN based methods have some problems in model input, feature aggregation and predictions fusion; Pose-based methods cannot effectively capture the joint interactions globally; Moreover, it remains to be considered that how to combine the two complementary tasks: RGB-based action recogntion and pose-based action recognition. In order to solve the above problems, based on deep learning, we study the video feature extraction, two stream predictions fusion and multitask training, which are of great value in theory. Main work and contributions are as follows:

(1)An end-to-end video-level representation learning approach, namely Deep networks with Temporal Pyramid Pooling(DTPP)is proposed. Current methods based on two-stream CNN suffer from the confusion caused by partial observation training, without end-to-end learning, restricted to single temporal scale modeling, high-dimensional feature. We propose DTPP to solve these problems. Specifically, at first, RGB images and optical flow stacks are sparsely sampled across the whole video. Then a temporal pyramid pooling layer is used to aggregate the frame-level features which consist of spatial and temporal cues. Lastly, the trained model has compact video-level representation with multiple temporal scales, which is both sequence-aware and low-dimensional. Experimental results show that DTPP achieves the state-of-the-art performance on two challenging video action datasets: UCF101 and HMDB51, either by ImageNet pre-training or Kinetics pre-training. In visualization experiment on HMDB51, DTPP outperforms TSN (Temporal Segment Network) in videos with sequence pattern.

(2)A gated Two-stream CNN approach, namely Gated TSN, is proposed to learn the fusion weights adaptively. Gated TSN consists of a Two-stream CNN(TSN) and a gating CNN. The gating CNN takes the combination of convolutional layers of the spatial and temporal nets as input and outputs two fusion weights. To reduce the over-fitting of gating CNN caused by the redundancy of parameters, a new multi-task learning method is designed, which jointly learns the gating fusion weights for the two streams and learns the gating CNN for action classification. Experiments show that the proposed gated fusion method is superior to the weighted averaging scheme to some extent.

(3)A pose-based action recognition approach, namely, Convolutional Relation Network, is proposed. Previous methods in pose-based action recognition cannot explicitly capture the joint interactions in a global manner. We address this problem by considering the potential relations of all the node pairs and edge pairs in skeleton graph. A dilation group-specific 1\times2 convolution module is proposed to aggregate relation messages of all the unit pairs in skeleton graph. By enumerating all the pair relations, the unit interactions can be learned explicitly and globally. It is then enhanced by introducing the attention mechnism including temporal attention, spatial attention and channel attention. Finally, the late fusion of four streams is used to combine the predictions of different inputs including node pairs, edge pairs and corresponding frame differences. Experiments including confusion analysis, timing and accuracy per-category visualization are conducted. The proposed method achieves competitive performance on two large scale datasets: NTU RGB+D and Kinetics.

(4)A multitask action recognition method, namely Action Machine, is proposed to combine the advantages of RGB-based action recognition and pose-based action recognition. Action Machine is composed of three tasks: RGB-based action recognition, human pose estimation and pose-based action recognition, which are trained in a multitask manner. It employs I3D for video feature extraction and obtains human feature using RoIAlign. The human feature is used for RGB-based action recognition and at the same time, is used as the input for the subnetwork of human pose estimation. The output of human pose estimation subnetwork is multiframe human pose, which is then treated as an image and fed into a CNN for pose-based action recognition. The predictions of two action recognition networks are summed to get the final output. Experiments are conducted on COCO dataset for human pose estimation and NTU RGB+D, N-UCLA, MSR Daily and AVA for human action recognition. In cross-dataset experiment, the proposed approach shows significant improvement over the strong baseline I3D, by more than 7-10% in accuracy, demonstrating our method is less prone to overfit the objects and scenes and shows better generalizability in different datasets.

(5)A system of online realtime action recognition for multiperson in natural scenes is built. This system is composed of human detection, human tracking and action recognition. SSD-MobileNetV2 is used for human detection. A multitarget tracking algorithm, Deep Sort, is used for human tracking. The proposed multitask action recognition approach, Action Machine, is used for action recognition. This system can detect and track multiple persons in videos and at the same time, perform action recognition. The proposed system is deployed in a NVIDIA GeForce RTX 2080Ti GPU and a Jetson Nano respectively. By using the acceleration techniques, the system can run efficiently in these hardware platforms. Action recognition experiments are conducted in indoor and outdoor scenes, which demonstrates the effectiveness and robustness of the overall system.

In summary, the first and second work are both based on the whole image to perform action recognition and concentrate on how to improve the performance of Two-stream CNN. Aiming at the drawbacks of Two-stream CNN in the model input and feature aggregation, the first work proposes an end-to-end video-level representation learning approach. Aiming at the fixed fusion weights of Two-stream CNN, a gated fusion scheme with adaptive fusion weights is proposed. However, the models trained by the whole image without focusing on the person can easily overfit the scenes and objects. Therefore, in the third and fourth work we tend to pay more attention on the person in videos to perform action recognition. Aiming at the problem that previous methods in pose-based action recognition cannot explicitly capture the joint interactions in a global manner, the third work proposes a method considering global joint interactions. The fourth work combines the advantages of RGB-based action recognition and pose-based action recognition and proposes a multitask action recognition method. The fifth work uses the method in the fourth work, combines human detection and tracking and builds a online action recognition system on a embedded device.

Pages128
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/39111
Collection精密感知与控制研究中心_精密感知与控制
Recommended Citation
GB/T 7714
朱佳刚. 基于深度学习的人体行为识别研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.
Files in This Item:
File Name/Size DocType Version Access License
朱佳刚-博士学位论文.pdf(14733KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[朱佳刚]'s Articles
Baidu academic
Similar articles in Baidu academic
[朱佳刚]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[朱佳刚]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.