CASIA OpenIR  > 模式识别国家重点实验室  > 先进数据分析与学习
Thesis Advisor潘春洪
Degree Grantor中国科学院大学
Place of Conferral中国科学院自动化研究所
Degree Discipline模式识别与智能系统
Keyword视频表示 时空卷积 注意力机制 低秩分解 无监督学习



1. 提出一种基于张量伪低秩约束的视频表示算法。该方法核心思想是采用张量低秩约束提高表示模型鲁棒性。具体地,论文基于张量核范数约束构建了一种视频低秩表示模型,将视频数据解耦为低秩静态背景、稀疏动态前景及帧噪声。进一步以该模型输出为指导,构造了一种伪张量低秩网络,实现了端到端的学习模型参数。最后,提出一种纯数据驱动的网络初始化策略,无需进行反向传播,加快了模型收敛速度。该算法应用于超低分辨率视频动作识别,对比实验结果表明所提算法对噪声更加鲁棒且收敛速度更快。

2. 提出一种基于时空注意力机制的视频表示算法。该方法核心思想是利用注意力机制实现视频精准表示。具体地,论文基于张量低秩分解并通过通道分离变换建立了级联时空网络,实现了时域重要性选择和降低网络规模。进一步以网络中间层特征图为语义指导,构建了语义指导模块,提出了语义指导网络,实现了空域重要性选择和信息过滤。最后,以时空卷积为基础,提出视频片段级时空注意力网络,通过二阶段优化算法实现了时域、空域重要性联合获取。该算法应用于普通场景和跨领域场景视频动作识别,与三十余种主流算法对比结果表明所提算法可以有效提高视频精准表示能力。

3. 提出基于自步学习和生成式网络的无监督视频表示算法。该方法核心思想是根据样本分布构建无监督视频表示。具体地,论文提出一种截断自步约束子,实现了样本可学习程度的自适应精确刻画。在此基础上,以自步学习机制为框架,构建了无监督自步特征嵌入模型,实现了模型由简单到复杂的动态提升。最后,从保持通道一致性角度出发,构造了跨通道颜色梯度损失,并以对抗网络学习策略为指导,建立了逆向伪双流生成网络,实现了概率化视频表示。论文理论上证明了所提自步约束子的合理性及自步特征嵌入模型的收敛性。算法应用于视频描述和视频预测,与二十余种主流算法对比结果表明所提算法在多种评价指标上优于主流无监督方法。

Other Abstract

Videos exist widely in every aspect of social and military applications, such as Intelligent Home, Driving Assistance, City Surveillance and Missile Guidance. However, there are much redundant information and irrelevant features for complex videos, which increases the difficulty of accurate video understanding. Consequently, how to obtain a robust video representation has become a hot topic for computer vision.

Video representation aims to dig out the latent information for videos, but traditional methods still face many challenges. First, how to improve the robustness of representation models that might be impacted by the cluttered backgrounds and occlusions. Second, how to obtain an accurate video representation that might be influenced by the spatial-temporal redundancy. Third, how to construct an effective unsupervised video representation that might be dependent on the inhomogeneous video distribution. In view of the above issues, the contributions of this dissertation are as follows:

1. A pseudo low rank model is proposed. The motivation behind this design is to achieve robust video representation based on tensor nuclear norm regularization. Specifically, a Video Low Rank Representation (VLRR) model, which can decompose the original video into low rank background, sparse foreground and frame-level noise, is first constructed. Casting on the output of VLRR, a pseudo Low Rank Network (pLRN), which is capable of learning network parameters in an end-to-end manner, is presented. At last, a totally data-driven network initialization strategy, which can accelerate network convergence, is raised. This algorithm is applied to extreme Low Resolution (eLR) action recognition, and experiments demonstrate that the proposed model is more robust and faster.

2. A joint spatial-temporal attention model is proposed. The motivation behind this design is to achieve accurate video representation built on attention mechanism. Specifically, a Cascaded Temporal Spatial network (CTS), which can select the most important time stamp and reduce model complexity, is proposed. After that, three kinds of Semantic Guided Module (SGM) and a Semantic Guided Network (SGN), which can focus on the most salient areas of videos and filter the irrelevant information, are elaborately designed. At last, a clip-level Joint Spatial Temporal Attention (JSTA) model based on 3D convolution and recurrent networks is proposed. JSTA can be optimized via a two-stage strategy. This algorithm is applied for both environment constrained and cross-domain action recognition. Experiments compared with more than thirty state-of-the-art methods illustrate that the proposed method is capable of reducing model complexity and improving recognition accuracy.

3. An unsupervised model relying on self-paced learning and generative network is proposed. The motivation behind this design is to achieve unsupervised representations casting on video distribution. Specifically, an interceptive Self Paced regularizer (iSPr), which can depict sample weights adaptively, is proposed. Furthermore, a novel Self pAced Feature Embedding (SAFE) method, which can aggregate itself from simple to complex, is constructed. At last, a Cross Channel Color Gradient (3CG) loss is raised to keep the consistency among video channels, and a Pseudo-Reverse Two-Stream (PRTS) network is presented to achieve probabilistic video representation. Theoretical analysis demonstrates the rationality of the proposed iSPr and SAFE. This algorithm is applied for video caption and video prediction, and extensive experiments compared with more than twenty state-of-the-art algorithms indicate that the proposed method achieves better results.

Document Type学位论文
Recommended Citation
GB/T 7714
于廷照. 复杂场景视频表示方法及其应用研究[D]. 中国科学院自动化研究所. 中国科学院大学,2019.
Files in This Item:
File Name/Size DocType Version Access License
【Tsingzao】复杂场景视频表示方法(25438KB)学位论文 开放获取CC BY-NC-SAApplication Full Text
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[于廷照]'s Articles
Baidu academic
Similar articles in Baidu academic
[于廷照]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[于廷照]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.