基于深度神经网络的运动检测与识别研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 视频内容安全

	基于深度神经网络的运动检测与识别研究
	杜杨
	2019-05-25
页数	161
学位类型	博士
中文摘要	视频中的运动检测与识别是计算视觉领域的重要任务，受到学术界、工业界和商业界的广泛关注，可以广泛应用于人机交互、智能监控、安防等重要领域。运动检测是将视频中的动态前景目标与背景进行分割的检测技术，也被称作动态目标检测，是后续高层级计算机视觉技术的基础之一。人作为社会的主导者，是动态目标中的主体，因此对人体目标的行为进行识别具有重要的研究意义和应用价值。近几年，深度神经网络模型的复兴，引发了新一场计算机视觉领域的人工智能革命，也给动态目标检测和行为识别领域注入了新的活力。深度神经网络模型虽然曾沉寂过一段时间，但随着硬件技术的进一步发展，其潜力被进一步挖掘，更多的落地应用期待被实现。因此，研究深度神经网络背景下的动态目标检测和人体行为识别任务具有重要的理论意义和应用价值。本论文从考查视频运动检测与识别的特点出发，研究了深度神经网络模型下的动态目标检测和人体行为识别技术。论文的主要工作和创新点归纳如下： 1.提出了一种基于深度时空自组织神经网络的动态目标检测算法动态目标检测中，对复杂背景运动的时空特性进行建模是一件极具挑战的事情。本文分析得到复杂背景的运动，具有在空间上的全局变化特性和时间上局部变化的特性。因此，1）提出一个新的可以被视频帧中所有像素点共享权值的时空自组织神经网络。本文使用视频整帧的变化序列和单个像素位置随着时间变化的序列，来训练提出的时空自组织神经网络。该网络可以对复杂背景有效地进行建模和跟踪。2）提出一个基于贝叶斯参数估计的方法，来为每个像素位置自动地学习判定前景或者背景的阈值参数。3）为了对复杂背景运动更精确地进行建模，本文将提出的单层时空自组织神经网络扩展到一个深度网络结构，并在国际公开数据库上取得了领先的结果。 2.提出了一种基于层级非线性正交自适应子空间自组织神经网络特征提取的多样本行为识别算法对人体目标实现行为识别，基于局部特征描述的多样本行为识别中的特征提取是一个关键步骤。传统手工特征常因其固定的形式而受到限制，深度学习特征表征能力更强但通常需要大规模的标签数据。因此，本文提出一个新的层级非线性正交自适应子空间自组织神经网络，来自适应地、无监督地从大规模数据中学习有效的特征。1）通过构建一个非线性正交映射层，改进后的网络模型可以处理非线性输入数据，然后使用核函数技巧避免定义具体的非线性映射函数形式和保证映射后正交基向量的正交性。2）修改目标损失函数，使网络可以有效地、无监督地从大数据中学习特征模式。3）提出层级的深度结构来提取更具表征能力的高层特征。国际公开数据库结果表明提出的新的无监督特征性能超过了传统的手工特征和部分深度学习特征。 3.提出了一种基于时空金字塔交互感知注意力机制网络的多样本行为识别算法对人体目标实现行为识别，很多基于端到端网络的多样本行为识别方法忽略了对关键人体行为区域的定位与检测。而视频中不仅帧内有大量与人体行为无关的信息，在帧间也包含很多行为时序信息。自注意力机制可以实现深度神经网络对关键区域的检测，其使用特征内部元素的加权和或者其它函数，获得特征的注意力得分，没有考虑特征间的相关性。但特征图中相邻空间位置的局部特征因感受野具有很高的重叠度，而具有很高的相关性。因此，1）本文提出受主成分分析（PCA）启发的交互感知注意力机制，以消除相关性并提取特征图中关键局部特征。2）使用深度网络中不同尺度的特征图，构建空间特征金字塔，利用多尺度信息计算更精确的注意力得分。3）提出的网络层与输入特征图的数量无关，因此，被扩展到一个时空版本。4）提出的网络层可以被嵌入到通用的深度卷积神经网络中，构成一个视频级的深度神经网络。国际公开数据库的结果表明了提出的网络的通用性和有效性。 4.提出了一种基于特征变换度量网络的少样本行为识别算法对人体目标实现行为识别，样本数量的限制是其面临的另外一个问题。样本过少会造成深度网络的过拟合训练，从而导致各个行为新类别的特征分布零散，难以进行高精度的分类。余弦相似度和欧氏距离只考虑了特征间角度或者几何空间距离中的一种度量方式，从而会造成度量的不精确性。本文提出一种新的基于度量学习的少样本分类模型来解决上面提到的限制。1）本文提出了一种特征变换网络，通过实现缩小同类特征之间的距离来减小类内距离，通过移动类中心位置来增加类间距离。具体而言，在训练阶段学习每个特征与其正确行为类中心之间的非线性残差，在测试阶段将网络预测的非线性残差与原特征相加，使其移动到正确行为类中心的偏移位置。2）本文提出了一种特征度量网络，可以有效地学习适应数据类型的度量参数，并且利用余弦相似度作为权重来对欧式距离进行加权。新的度量方式同时考虑了特征角度和几何距离的影响。在国际公开数据库上显著提升的实验结果，表明了本文提出的少样本行为识别深度网络框架的有效性，以及度量网络的通用性。总的来说，本文分析了视频运动检测与识别中一些亟待解决的关键问题，并提出了较好的解决方案，提出的算法大幅度提升了动态目标检测与人体行为识别的性能，并在多个国际公开数据库上取得了当时最好的结果。与此同时，本文提出的多样本学习下的人体行为识别算法已经在美图科技公司得到了实际应用，取得了一定的经济效益。
英文摘要	Motion detection and recognition in video are important tasks in the field of computer vision, which are widely concerned by the academia, industry and business circles, and can be widely used in human-computer interaction, intelligent monitoring, security and other important fields. Motion detection, also known as dynamic object detection, is a detection technology that divides foreground dynamic object and background in video. It is one of the foundations of subsequent high-level computer vision technology. As the leader of the society, human is the subject of the dynamic object, so the identification of human object action has important research significance and application value. In recent years, the revival of deep neural network model has triggered a new artificial intelligence revolution in the field of computer vision and injected new vitality into the field of dynamic object detection and action recognition. Deep neural network model was silent for a period of time, with the further development of hardware technology, its potential is further explored and more practical applications are expected to be realized. Therefore, the research on dynamic object detection and human action recognition task under the background of deep neural network has important theoretical significance and application value. In this paper, the characteristics of motion detection and recognition are exploited, and then the dynamic object detection and human action recognition technologies based on deep neural network are studied. The main work and innovation points of the paper are summarized as follows: 1. Spatio-Temporal Self-Organizing Map Deep Network for Dynamic Object Detection from Videos In dynamic object detection, it is challenging to construct an effective model to characterize the spatial-temporal properties of background. This work proposes that the motions of complex background have the global variation in the space and the local variation in the time. So, 1) a novel \emph{Spatio-Temporal Self-Organizing Map} (STSOM) shared by all pixels in a video frame is presented. This paper trains STSOM by using the whole frames and the sequence of a pixel over time to model and tackle the variance of complex background effectively. 2) A Bayesian parameter estimation based method is presented to learn thresholds automatically for all pixels to filter out the background. 3) In order to model the complex background more accurately, this paper extends the single-layer STSOM to the deep network. The proposed model obtains the state-of-the-art results on international public dataset. 2. Hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map based Feature Extraction for Multi-shot Human Action Recognition Feature extraction based on local feature description is a key step in multi-shot human action recognition. Hand-crafted features are often restricted because of their fixed forms and deep learning features are more effective but need large-scale labeled data for training usually. So, this paper proposes a new hierarchical \emph{Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map} (NOASSOM) to adaptively and unsupervisedly learn effective features from data. 1) By adding a nonlinear orthogonal map layer, NOASSOM is able to handle the nonlinear input data and it avoids defining the specific form of the nonlinear orthogonal map by a kernel trick. 2) By modifying the loss function such that NOASSOM can effectively learn the statistic patterns from data without supervision. 3) This paper proposes a deep hierarchical NOASSOM to extract more representative local features. Experimental results on widely used datasets show that our method has superior performance than many state-of-the-art hand-crafted features and deep learning features based methods. 3. Interaction-aware Spatio-temporal Pyramid Attention Networks for Multi-shot Action Classification Many methods of multi-shot action recognition based on end-to-end network structure ignore the location and detection of key human action region. Video contains not only a lot of information irrelevant to human behavior, but also a lot of behavioral timing information between frames. Self-attention can focus on key features but its attention scores are often obtained by individually weighting every feature itself without considering the interaction among features. Local features in the neighbouring spatial positions of feature maps are often highly relevant because of their overlapping receptive fields. So, 1) this paper proposes an interaction-aware self-attention model inspired by PCA to eliminate correlation to extract key features. 2) By using feature maps of different scales to construct a spatial pyramid, more accurate attention scores are obtained by multi-scale information. 3) The proposed attention layer is unrestricted to the input number of feature maps. So it is easily extended to a temporal version. Finally, the model is embedded in general CNNs to form end-to-end attention networks for action classification. Experiments on international public datasets show the generality and effectiveness of the proposed model. 4.Feature Transform Measure Networks for Few-shot Action Recognition The number limit of samples is another problem for human action recognition. Too few samples will lead to over-fitting of deep network, which will result in scattered feature distribution of each new action category and make it difficult to conduct high-precision classification. $Cosine$ similarity and Euclidean distance only consider either of an angle or a geometric distance for measuring features, which results in inaccuracy. This paper proposes a new metric learning model for few-shot action classification to address the above limitations. 1) A feature transform network is proposed to reduce intra-class distance by shrinking the margin among features of the same action class, and increase inter-class distance by shifting locations of action class centers. The nonlinear residuals between each feature and correct action class center are learnt in the training process. Then, in the test process, the residual predicted by network is added into the original feature to make it approach its correct shifted action class center. 2) A feature measure network is proposed to learn effective distance measurement from data, where $cosine$ similarity is used as a weight to leverage Euclidean distance. The new measurement considers both of the angle and geometric distances. The experimental results on international public datasets show the effectiveness of the proposed deep network framework for action recognition with few samples and the generality of the measurement network. In general, this paper analyzes some key problems to be solved in video motion detection and recognition, and puts forward better solutions. The proposed algorithms greatly improve the performances of dynamic object detection and human action recognition, and obtain the state-of-the-art results on multiple international public databases at that time. At the same time, the human action recognition algorithm proposed in this paper has been applied in MeiTu Technology Co., Ltd. and has achieved certain economic benefits.
关键词	深度神经网络，视频运动分析，运动检测与识别，动态目标检测，人体行为识别
学科领域	模式识别 ; 计算机感知 ; 计算机神经网络
学科门类	工学::控制科学与工程
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23804
专题	多模态人工智能系统全国重点实验室_视频内容安全
通讯作者	杜杨
推荐引用方式 GB/T 7714	杜杨. 基于深度神经网络的运动检测与识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于深度神经网络的运动检测与识别研究.p（9008KB）	学位论文		开放获取	CC BY-NC-SA