复杂场景视频表示方法及其应用研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 先进时空数据分析与学习

	复杂场景视频表示方法及其应用研究
	于廷照
	2019-05
页数	146
学位类型	博士
中文摘要	视频数据广泛存在于智能家居、辅助驾驶、城市监控、军事制导等社会和军事各个层面，并具有巨大的应用潜力和市场价值。然而复杂场景视频中存在大量冗余信息和无关特征，给精准视频理解带来巨大挑战。如何提取复杂场景视频的鲁棒表示是计算机视觉的研究热点。视频表示旨在挖掘其内容的潜在价值信息，但现有方法仍然面临诸多问题：首先，视频拍摄背景通常杂乱且遮挡现象突出，如何提高模型鲁棒性是视频表示的首要难题；其次，视频主体姿态各异且时空冗余明显，如何实现视频精准表示是又一研究难点；最后，视频数据分布不均且标注困难，如何构建有效的无监督视频表示模型也是重要挑战。本文围绕以上问题展开研究，主要贡献如下： 1. 提出一种基于张量伪低秩约束的视频表示算法。该方法核心思想是采用张量低秩约束提高表示模型鲁棒性。具体地，论文基于张量核范数约束构建了一种视频低秩表示模型，将视频数据解耦为低秩静态背景、稀疏动态前景及帧噪声。进一步以该模型输出为指导，构造了一种伪张量低秩网络，实现了端到端的学习模型参数。最后，提出一种纯数据驱动的网络初始化策略，无需进行反向传播，加快了模型收敛速度。该算法应用于超低分辨率视频动作识别，对比实验结果表明所提算法对噪声更加鲁棒且收敛速度更快。 2. 提出一种基于时空注意力机制的视频表示算法。该方法核心思想是利用注意力机制实现视频精准表示。具体地，论文基于张量低秩分解并通过通道分离变换建立了级联时空网络，实现了时域重要性选择和降低网络规模。进一步以网络中间层特征图为语义指导，构建了语义指导模块，提出了语义指导网络，实现了空域重要性选择和信息过滤。最后，以时空卷积为基础，提出视频片段级时空注意力网络，通过二阶段优化算法实现了时域、空域重要性联合获取。该算法应用于普通场景和跨领域场景视频动作识别，与三十余种主流算法对比结果表明所提算法可以有效提高视频精准表示能力。 3. 提出基于自步学习和生成式网络的无监督视频表示算法。该方法核心思想是根据样本分布构建无监督视频表示。具体地，论文提出一种截断自步约束子，实现了样本可学习程度的自适应精确刻画。在此基础上，以自步学习机制为框架，构建了无监督自步特征嵌入模型，实现了模型由简单到复杂的动态提升。最后，从保持通道一致性角度出发，构造了跨通道颜色梯度损失，并以对抗网络学习策略为指导，建立了逆向伪双流生成网络，实现了概率化视频表示。论文理论上证明了所提自步约束子的合理性及自步特征嵌入模型的收敛性。算法应用于视频描述和视频预测，与二十余种主流算法对比结果表明所提算法在多种评价指标上优于主流无监督方法。
英文摘要	Videos exist widely in every aspect of social and military applications, such as Intelligent Home, Driving Assistance, City Surveillance and Missile Guidance. However, there are much redundant information and irrelevant features for complex videos, which increases the difficulty of accurate video understanding. Consequently, how to obtain a robust video representation has become a hot topic for computer vision. Video representation aims to dig out the latent information for videos, but traditional methods still face many challenges. First, how to improve the robustness of representation models that might be impacted by the cluttered backgrounds and occlusions. Second, how to obtain an accurate video representation that might be influenced by the spatial-temporal redundancy. Third, how to construct an effective unsupervised video representation that might be dependent on the inhomogeneous video distribution. In view of the above issues, the contributions of this dissertation are as follows: 1. A pseudo low rank model is proposed. The motivation behind this design is to achieve robust video representation based on tensor nuclear norm regularization. Specifically, a Video Low Rank Representation (VLRR) model, which can decompose the original video into low rank background, sparse foreground and frame-level noise, is first constructed. Casting on the output of VLRR, a pseudo Low Rank Network (pLRN), which is capable of learning network parameters in an end-to-end manner, is presented. At last, a totally data-driven network initialization strategy, which can accelerate network convergence, is raised. This algorithm is applied to extreme Low Resolution (eLR) action recognition, and experiments demonstrate that the proposed model is more robust and faster. 2. A joint spatial-temporal attention model is proposed. The motivation behind this design is to achieve accurate video representation built on attention mechanism. Specifically, a Cascaded Temporal Spatial network (CTS), which can select the most important time stamp and reduce model complexity, is proposed. After that, three kinds of Semantic Guided Module (SGM) and a Semantic Guided Network (SGN), which can focus on the most salient areas of videos and filter the irrelevant information, are elaborately designed. At last, a clip-level Joint Spatial Temporal Attention (JSTA) model based on 3D convolution and recurrent networks is proposed. JSTA can be optimized via a two-stage strategy. This algorithm is applied for both environment constrained and cross-domain action recognition. Experiments compared with more than thirty state-of-the-art methods illustrate that the proposed method is capable of reducing model complexity and improving recognition accuracy. 3. An unsupervised model relying on self-paced learning and generative network is proposed. The motivation behind this design is to achieve unsupervised representations casting on video distribution. Specifically, an interceptive Self Paced regularizer (iSPr), which can depict sample weights adaptively, is proposed. Furthermore, a novel Self pAced Feature Embedding (SAFE) method, which can aggregate itself from simple to complex, is constructed. At last, a Cross Channel Color Gradient (3CG) loss is raised to keep the consistency among video channels, and a Pseudo-Reverse Two-Stream (PRTS) network is presented to achieve probabilistic video representation. Theoretical analysis demonstrates the rationality of the proposed iSPr and SAFE. This algorithm is applied for video caption and video prediction, and extensive experiments compared with more than twenty state-of-the-art algorithms indicate that the proposed method achieves better results.
关键词	视频表示时空卷积注意力机制低秩分解无监督学习
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/23779
专题	多模态人工智能系统全国重点实验室_先进时空数据分析与学习
推荐引用方式 GB/T 7714	于廷照. 复杂场景视频表示方法及其应用研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
【Tsingzao】复杂场景视频表示方法（25438KB）	学位论文		开放获取	CC BY-NC-SA