CASIA OpenIR  > 模式识别国家重点实验室  > 先进数据分析与学习
复杂场景视频表示方法及其应用研究
于廷照
Subtype博士
Thesis Advisor潘春洪
2019-05
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword视频表示 时空卷积 注意力机制 低秩分解 无监督学习
Abstract

视频数据广泛存在于智能家居、辅助驾驶、城市监控、军事制导等社会和军事各个层面,并具有巨大的应用潜力和市场价值。然而复杂场景视频中存在大量冗余信息和无关特征,给精准视频理解带来巨大挑战。如何提取复杂场景视频的鲁棒表示是计算机视觉的研究热点。

视频表示旨在挖掘其内容的潜在价值信息,但现有方法仍然面临诸多问题:首先,视频拍摄背景通常杂乱且遮挡现象突出,如何提高模型鲁棒性是视频表示的首要难题;其次,视频主体姿态各异且时空冗余明显,如何实现视频精准表示是又一研究难点;最后,视频数据分布不均且标注困难,如何构建有效的无监督视频表示模型也是重要挑战。本文围绕以上问题展开研究,主要贡献如下:

1. 提出一种基于张量伪低秩约束的视频表示算法。该方法核心思想是采用张量低秩约束提高表示模型鲁棒性。具体地,论文基于张量核范数约束构建了一种视频低秩表示模型,将视频数据解耦为低秩静态背景、稀疏动态前景及帧噪声。进一步以该模型输出为指导,构造了一种伪张量低秩网络,实现了端到端的学习模型参数。最后,提出一种纯数据驱动的网络初始化策略,无需进行反向传播,加快了模型收敛速度。该算法应用于超低分辨率视频动作识别,对比实验结果表明所提算法对噪声更加鲁棒且收敛速度更快。

2. 提出一种基于时空注意力机制的视频表示算法。该方法核心思想是利用注意力机制实现视频精准表示。具体地,论文基于张量低秩分解并通过通道分离变换建立了级联时空网络,实现了时域重要性选择和降低网络规模。进一步以网络中间层特征图为语义指导,构建了语义指导模块,提出了语义指导网络,实现了空域重要性选择和信息过滤。最后,以时空卷积为基础,提出视频片段级时空注意力网络,通过二阶段优化算法实现了时域、空域重要性联合获取。该算法应用于普通场景和跨领域场景视频动作识别,与三十余种主流算法对比结果表明所提算法可以有效提高视频精准表示能力。

3. 提出基于自步学习和生成式网络的无监督视频表示算法。该方法核心思想是根据样本分布构建无监督视频表示。具体地,论文提出一种截断自步约束子,实现了样本可学习程度的自适应精确刻画。在此基础上,以自步学习机制为框架,构建了无监督自步特征嵌入模型,实现了模型由简单到复杂的动态提升。最后,从保持通道一致性角度出发,构造了跨通道颜色梯度损失,并以对抗网络学习策略为指导,建立了逆向伪双流生成网络,实现了概率化视频表示。论文理论上证明了所提自步约束子的合理性及自步特征嵌入模型的收敛性。算法应用于视频描述和视频预测,与二十余种主流算法对比结果表明所提算法在多种评价指标上优于主流无监督方法。

Other Abstract

Videos exist widely in every aspect of social and military applications, such as Intelligent Home, Driving Assistance, City Surveillance and Missile Guidance. However, there are much redundant information and irrelevant features for complex videos, which increases the difficulty of accurate video understanding. Consequently, how to obtain a robust video representation has become a hot topic for computer vision.

Video representation aims to dig out the latent information for videos, but traditional methods still face many challenges. First, how to improve the robustness of representation models that might be impacted by the cluttered backgrounds and occlusions. Second, how to obtain an accurate video representation that might be influenced by the spatial-temporal redundancy. Third, how to construct an effective unsupervised video representation that might be dependent on the inhomogeneous video distribution. In view of the above issues, the contributions of this dissertation are as follows:

1. A pseudo low rank model is proposed. The motivation behind this design is to achieve robust video representation based on tensor nuclear norm regularization. Specifically, a Video Low Rank Representation (VLRR) model, which can decompose the original video into low rank background, sparse foreground and frame-level noise, is first constructed. Casting on the output of VLRR, a pseudo Low Rank Network (pLRN), which is capable of learning network parameters in an end-to-end manner, is presented. At last, a totally data-driven network initialization strategy, which can accelerate network convergence, is raised. This algorithm is applied to extreme Low Resolution (eLR) action recognition, and experiments demonstrate that the proposed model is more robust and faster.

2. A joint spatial-temporal attention model is proposed. The motivation behind this design is to achieve accurate video representation built on attention mechanism. Specifically, a Cascaded Temporal Spatial network (CTS), which can select the most important time stamp and reduce model complexity, is proposed. After that, three kinds of Semantic Guided Module (SGM) and a Semantic Guided Network (SGN), which can focus on the most salient areas of videos and filter the irrelevant information, are elaborately designed. At last, a clip-level Joint Spatial Temporal Attention (JSTA) model based on 3D convolution and recurrent networks is proposed. JSTA can be optimized via a two-stage strategy. This algorithm is applied for both environment constrained and cross-domain action recognition. Experiments compared with more than thirty state-of-the-art methods illustrate that the proposed method is capable of reducing model complexity and improving recognition accuracy.

3. An unsupervised model relying on self-paced learning and generative network is proposed. The motivation behind this design is to achieve unsupervised representations casting on video distribution. Specifically, an interceptive Self Paced regularizer (iSPr), which can depict sample weights adaptively, is proposed. Furthermore, a novel Self pAced Feature Embedding (SAFE) method, which can aggregate itself from simple to complex, is constructed. At last, a Cross Channel Color Gradient (3CG) loss is raised to keep the consistency among video channels, and a Pseudo-Reverse Two-Stream (PRTS) network is presented to achieve probabilistic video representation. Theoretical analysis demonstrates the rationality of the proposed iSPr and SAFE. This algorithm is applied for video caption and video prediction, and extensive experiments compared with more than twenty state-of-the-art algorithms indicate that the proposed method achieves better results.

Pages146
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/23779
Collection模式识别国家重点实验室_先进数据分析与学习
Recommended Citation
GB/T 7714
于廷照. 复杂场景视频表示方法及其应用研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.
Files in This Item:
File Name/Size DocType Version Access License
【Tsingzao】复杂场景视频表示方法(25438KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[于廷照]'s Articles
Baidu academic
Similar articles in Baidu academic
[于廷照]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[于廷照]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.