CASIA OpenIR  > 毕业生  > 博士学位论文
行为识别轻量化模型研究
程科
Subtype博士
Thesis Advisor卢汉清
2022-05-23
Degree Grantor中国科学院自动化研究所
Place of Conferral中国科学院自动化研究所
Degree Name工学博士学位
Degree Discipline模式识别与智能系统
Keyword行为识别,轻量化模型,时空模型,图卷积神经网络,网络结构设计
Abstract

识别人类的行为是计算机视觉研究领域的一个重要分支,其研究内容是通过计算机自动地分析和理解图像或视频中的人在做什么。行为识别在智能监控安防、人机交互、辅助驾驶以及基于内容的视频检索等领域有着广阔的应用前景。

近年来,基于深度学习的行为识别方法在识别精度上取得了显著的提升,引起了学术界和产业界的广泛关注。然而,随着行为识别模型性能的提升,模型结构越来越复杂,模型的计算量也随之变大。智能监控、人机交互、辅助驾驶、视频检索等实际应用对行为识别模型的计算高效性有很高的要求。尤其是在计算资源受限以及对实时性要求较高的应用场景下,行为识别模型高昂的计算代价、较大的内存占用成为了阻碍其部署的主要障碍。因此,行为识别轻量化模型研究对于推动行为识别技术在各个领域的应用具有重要的现实意义和应用价值。

本文针对目前主流行为识别模型计算量较大的问题,从高效时空建模单元、精简时空网络结构等方面展开深入研究,提出了计算量更小、计算速度更快、占用存储空间更小的行为识别轻量化模型。本文的研究内容以及创新点归纳如下:

1. 基于移位图卷积的轻量化人体关键点序列行为识别方法。现有的人体关键点序列行为识别方法有两方面的问题:一、计算量过大,不利于在移动端计算设备上进行模型部署;二、空间和时间两个维度上的感受野是通过启发式的定义,因而限制了神经网络的表达能力。本文提出移位图卷积神经网络来解决上述两个问题。移位图卷积神经网络采用移位操作来扩展卷积核的感受野,并且可以通过调节移位量来灵活地调节感受野的范围。移位图卷积神经网络包括空间移位图卷积和时间移位图卷积。在空间维度,空间移位图卷积用来建模人体关键点之间的关系;在时间维度,时间移位图卷积用来建模前后帧的时序关系。实验表明,相比于同时期的人体关键点序列行为识别方法,本文所提出的移位图卷积神经网络在精度更高的同时,计算量节省10倍以上。

2. 基于轻量化网络结构的人体关键点序列行为识别方法。针对网络结构上存在的计算量冗余,本文对移位图卷积神经网络进一步地进行模型压缩与计算加速,提出计算量更小的超轻量移位图卷积神经网络。首先,通过网络结构搜索策略得到轻量化的网络结构。然后,提出显式空间位置编码机制、动态移位图卷积和修正线性单元蒸馏策略,在计算量几乎不增加的前提下,显著提升轻量化网络的时空建模能力。考虑到神经网络的理论计算量与实测计算时间并不是严格对应,本文探讨了多种人体关键点序列行为识别算法在移动端设备上的实测推理时间以及内存占用,为人体关键点序列行为识别领域的模型实测高效性提供了基准评测。实验表明,相比于移位图卷积神经网络,超轻量移位图卷积神经网络在精度保持的情况下,计算量进一步降低6倍,实测速度加快2倍。

3. 基于解耦图卷积和图正则化的人体关键点序列行为识别方法。现有的基于图卷积的人体行为识别方法中各个特征通道采用相同的邻接矩阵,导致空间建模能力受限。本文提出解耦图卷积模型,对不同特征通道的邻接矩阵解耦,在显著提升空间建模能力的同时,不增加计算量,不增加显存占用,也不增加推理计算时间。此外,目前主流的基于图卷积的人体行为识别模型普遍存在过拟合问题,而现有的正则化方法在基于图卷积的人体行为识别模型上效果不理想。本文提出图正则化方法,并提出注意力增强机制进一步增强正则化效果,在显著提升模型性能的同时,不增加推理的计算量和实测时间。实验表明,相比同时期其他方法,本文方法在取得更高的模型精度的同时,具有更小的计算复杂度。

4. 面向压缩域视频的轻量化行为识别方法。图像序列行为识别中光流提取过程和时空卷积计算过程的耗时较长,导致图像序列行为识别模型难以进行高效地部署。本文提出高效运动信息互补网络,在更小的计算量下达到更高的图像序列行为识别精度。本文分析了图像序列行为识别中的时间消耗,将其分为数据准备时间和网络运行时间。在节省数据准备时间方面,本文采用压缩域视频中的运动矢量图作为光流的高效替代,并提出定长累积运动矢量技术来提高运动矢量图的清晰度。在节省网络运行时间方面,本文分析了RGB支路和运动矢量支路的特点,提出了均衡双流策略,可以合理地分配RGB支路和运动矢量支路的计算复杂度。实验表明,相比于同时期其他方法,高效运动信息互补网络在更小的计算量下达到了更高的精度,并且具有实测速度的优势。

Other Abstract

Recognizing human actions is an important branch of computer vision, which aims to automatically analyze and understand what people are doing in images or videos with computer devices. Action recognition is widely used in many applications, such as intelligent surveillance, human-computer interaction, assisted driving and content-based video retrieval.

 

Recently, action recognition with deep-learning methods have achieved significant performance improvement and attracted much attention in academia and industry. However, with the improvement of action recognition performance, network structures become more and more complex, and the computational complexity of action recognition models becomes larger and larger. Many applications have high requirements on the computational efficiency of action recognition models, such as intelligent surveillance, human-computer interaction, assisted driving, video retrieval, etc. The expensive computational cost and large memory usage become the major obstacle to deploying action recognition models, especially for resource-constrained devices and real-time applications. Therefore, the research on lightweight action recognition models can promote the application of action recognition in various fields, with important theoretical and application values. 

 

Aiming at reducing the large computational cost of mainstream action recognition models, this thesis carries out a series of research from the aspects of efficient spatiotemporal modules and compact spatiotemporal network structures. The action recognition methods proposed in this thesis have smaller computational complexity, faster computation speed and smaller memory usage. The main contributions are summarized as follows:

 

1. Shift graph convolutional network for lightweight skeleton-based action recognition. There are two shortcomings of existing GCN-based skeleton action recognition models: 1) The computational cost is too heavy, resulting in the obstacle to deployment on mobile devices; 2) The receptive fields of both spatial dimension and temporal dimension are pre-defined heuristically, which limits the expressiveness of network. This thesis proposes a shift graph convolutional network (ShiftGCN) to address both shortcomings. Shift graph convolutional network uses shift operations to expand the receptive fields of convolution kernels and adjusts the receptive field by adjusting shift parameters flexibly. Shift graph convolutional network consists of spatial shift graph convolution and temporal shift graph convolution. In spatial dimension, spatial shift graph convolution models the relation between human skeletons; in temporal dimension, temporal shift graph convolution models the relation between frames. Compared to contemporaneous methods, shift graph convolutional network achieves higher accuracy with over ten times less computational cost.

 

2. An efficient skeleton-based action recognition model with lightweight network structure. To reduce the computation redundancy in network structure, this thesis further accelerates ShiftGCN and proposes a more lightweight model, coined as ShiftGCN++. This thesis introduces network architecture search methods, which search for a lightweight architecture for skeleton-based action recognition. Then, this thesis proposes explicit spatial position encoding, dynamic shift graph convolution and margin ReLU distillation. These three techniques notably improve the spatiotemporal modelling ability with negligible extra computation cost. Considering that network latency is not equivalent to computational complexity, this thesis discusses the latency and memory usage on mobile devices of skeleton-based action recognition methods. This thesis proposes a benchmark for the practical efficiency of the skeleton-based action recognition field. Compared to shift graph convolutional network, ShiftGCN++ achieves comparable accuracy with six times less computational cost and twice practical speed.

 

3. Decoupling GCN with DropGraph module for skeleton-based action recognition. In existing GCN-based action recognition methods, all channels share one adjacent matrix, which limits spatial modelling ability. This thesis proposes a decoupling graph convolutional network, which decouples the adjacent matrices of different channels. Decoupling graph convolutional network largely increases the expressiveness of spatial modeling ability with no extra computation, no extra latency and no extra GPU memory cost. Another prevalent problem of mainstream GCN-based action recognition methods is over-fitting. However, existing regularization methods are not effective for GCN-based action recognition methods. This thesis proposes a DropGraph regularization method and proposes an attention-guided mechanism to enhance the regularization effect, which notably improves the performance with no extra computation and latency. The proposed method achieves higher accuracy with less computational cost than contemporaneous methods.

 

4. A lightweight action recognition method for compressed videos. The calculation of optical flow and the computation of spatiotemporal convolution are two time-consuming processes in video action recognition methods, leading to the obstacle to efficient deployment. This thesis proposes an efficient motion complementary network (EMCN), which achieves higher accuracy with less computational cost. This thesis analyzes the time cost in action recognition and divides it into the data preparing time and the network computation time. To reduce the data preparing time, this thesis use motion vector in compressed videos as an efficient alternative to optical flow, and propose a fixed motion accumulation technique to improve the clarity of motion vector pictures. To reduce the network computation time, this thesis analyzes the characteristic of RGB stream and motion vector stream, and proposes a balanced motion policy to allocate the computational complexity between two streams appropriately. Compared to contemporaneous methods, the efficient motion complementary network achieves higher accuracy with less computational cost and faster speed.

Pages126
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/48890
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
程科. 行为识别轻量化模型研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
Files in This Item:
File Name/Size DocType Version Access License
行为识别轻量化模型研究.pdf(8804KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[程科]'s Articles
Baidu academic
Similar articles in Baidu academic
[程科]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[程科]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.