CASIA OpenIR  > 多模态人工智能系统全国重点实验室  > 视频内容安全
基于多模态深度对比聚类的自监督视频行为识别研究
魏久桐
2022-05-20
Pages60
Subtype硕士
Abstract

       随着移动互联网及多媒体技术的发展,大量视频应用的兴起促使网络视频
数量呈爆炸式增长。作为目前主流的信息传播方式,视频已广泛应用于数字媒
体、科技教育、安防监控等诸多领域。然而,日益增长的视频数据在满足用户
需求的同时,也为视频数据的组织、归类及应用带来了巨大挑战。现有基于监
督或半监督的视频行为分类方法虽然获得了显著的性能提升,但通常需要基于
大规模的高质量标签数据进行模型学习。因此,如何在无类别标签的视频数据
中准确地获取有效信息、探索视频数据本质结构特性及类别分布,成为了计算
机视觉和人工智能领域的热点研究课题之一。
       本文主要关注基于多模态特征学习的自监督视频行为识别问题,创新性地
提出了基于深度聚类的对比学习框架。本文首先针对视频特征矩阵的特点,将
对比学习分成实例对比和集群对比两种方式。针对视频中的背景偏差问题,在
实例对比模块中设计了背景加噪的正样本增强方法;针对实例对比学习中的语
义偏差问题,设计了更高层级的集群对比进行补充修正,最后将该框架应用于
音视频融合的多模态领域。通过多种迁移实验,证明了所提出算法的有效性。
具体来说,本文的主要工作和贡献如下:
        • 提出了一种背景加噪的数据增强方式,并将其应用于视频自监督框架中。
视频数据中行为及其发生的场景,往往都具有很强的依赖性,在这种情况下,
神经网络无需学习到复杂的时空语义信息,仅靠空间表观信息即可实现一定的
分类准确率,这种背景偏差的问题会导致模型的泛化性不足。针对这一问题,
本文借鉴图像领域中混合增强的思想,随机抽取训练视频中的一静态帧,并将
其叠加到训练样本的每一帧上,由此得到正样本进行对比学习。叠加后的增强
视频的表观信息发生了显著变化,但由于叠加的静态帧与训练样本的像素分布
相似,增强操作并不会对光流信息造成过多影响,从而保证了增强样本的语义
不会受到较大破坏,促使模型深度挖掘视频的深层时序语义信息。最后在行为
识别和视频检索任务上验证了这种方式的有效性。
        • 提出了一种针对视频的集群级对比学习方式。目前主流的视频实例对比
学习设置负样本的方法大多是直接从不同的视频中采样,但这种采样往往会将实际语义相似的视频在特征空间中相互推远,造成语义混淆。针对这一问题,
本文利用特征矩阵的列代表类别的特点,设计了一种集群级的对比损失函数。
具体的,首先使用编码器将每个视频表示成一维行向量,组成二维的特征矩阵,
然后利用特征矩阵的列向量进行对比学习。通过这种方式设置的负样本不再受
限于不同的实例,而是以更高层级的集群级语义做对比,从而促使模型在不同
的语义层级上对特征进行学习和修正。本文在两个公开数据行为识别数据库上
进行了识别和检索的迁移实验,取得了与当前最新方法可比较的指标。
        • 将集群级对比学习方法与音视频多模态数据相结合,提出了基于多模态
深度聚类的自监督视频行为识别框架。在主流的多模态对比学习框架中,往往
是将两种模态的数据映射到同一特征空间中进行对齐。但多模态数据在视频的
语义层面是互补关系,而并非完全一致,多模态特征严格对齐往往破坏了多模
态数据在特征空间中的流行分布。因此,本文将集群级对比学习应用于音视频
的多模态对比学习。具体的,模型分别对音频输入和 RGB 输入提取特征,互相
交叉进行集群对比学习。这种在集群上的多模态对比方式缓解了前述的对齐问
题,同时又能挖掘多模态之间更深层次的信息,促进提高模型对视频语义表征
的准确性和泛化性。为了验证所提方法的有效性,本文在两个公开视频行为识
别数据库上进行了识别和检索的迁移实验,不仅超过了主流的多模态自监督方
法,并且得到了很多极具价值的结论。
 

Other Abstract

With the development of mobile Internet and multimedia technology, the rise of a large number of video applications has led to an explosive growth of online video data. As one of the mainstream information transmission methods, video has been widely used in many fields, such as digital media, science and technology education, security monitoring, and so on. However, while meeting the needs of users, a large number of videos have brought huge challenges to the organization, categorization
and application. Although existing supervised or semi-supervised video classification methods have achieved significant performance, such methods usually require a large amount of high-quality label data for model learning. Therefore, how to accurately obtain valid information in unlabeled video data, and explore the essential structural characteristics and category distribution of video data has become a hot research topic
in the field of computer vision and artificial intelligence.

This thesis mainly focuses on the area of self-supervised video action recognition based on multimodal features learning, creatively proposing a contrastive learning framework based on deep clustering. Firstly, according to the characteristics of the video feature matrix, we divide contrastive learning into instance-level comparison and cluster-level comparison. For the problem of background deviation in videos,
a positive sample augmentation method based on background noise is designed in the instance-level contrastive module. For the semantic deviation in instance-level contrastive learning, we design cluster-level contrastive learning for supplementation. At last, we apply the framework to multimodal audio and video fusion. We conduct a variety of transfer learning experiments to demonstrate the effectiveness of our approaches.
More specifically, the main contributions of this thesis are summarized as follows:


• We propose a data augmentation method based on background noise and apply it to video self-supervised learning. The action in the video, and the context in which it happens, tend to be highly dependent. In this case, the traditional neural network does not need to learn complex spatiotemporal semantic information, but only relies on spatial information to achieve a certain classification accuracy. Such background
bias will lead to insufficient generalization of the model. Inspired by the idea of Mixup in the image field, we randomly sample a static frame from the training video and overlay it to each frame of the training clip, then we use this augmented positive sample for contrastive learning. Though the spatial information of the augmented video has changed significantly, the static frame is still similar to the pixel distribution of the training sample, and the augmentation does not affect the optical flow too much.
Therefore, it ensures that the semantics of the augmented sample will not be greatly compromised, prompting the model to delve into the  spatiotemporal semantics of video. We verified the effectiveness of this augmentation on action recognition and video retrieval tasks.


• We propose a novel cluster-level contrastive learning method for video understanding. Most existing contrastive learning methods for video representation learning sampled negative clips from different videos. However, this sampling strategy tends to push videos with similar semantics away from each other in the feature space, resulting in semantics confusion. In order to solve this problem, we design a cluster-level contrastive loss by the idea of feature matrix column dimensions representing class information. In particular, each video is represented as a one-dimensional row vector using an encoder, resulting in a two-dimensional feature matrix. Then, the column vectors of the feature matrix are used for contrastive learning. Negative clips sampled in this way are no longer limited to different instances but are compared with higher-level cluster semantics, prompting the model to learn and refine features at different semantical levels. We perform action recognition and video retrieval transfer learning on two public datasets and achieve comparable performance to the state-of-the-art approaches.


• We propose a self-supervised video action recognition framework based on multimodal deep clustering, by combining the audio and video with cluster-level contrastive learning. Most existing multimodal based methods usually map multimodal data to the same feature space for alignment. However, the semantic information about multimodal
data is complementary rather than consistent. The above alignment operation actually destroys the prevalence distribution of multimodal data in the feature space. Therefore, we apply cluster-level contrastive learning to the multimodal domain of audio-video interaction. Specifically, we calculate the feature matrix of audio and RGB input respectively and do cluster-level contrastive learning across each other. This multimodal
contrastive learning on the cluster-level avoids the alignment problem mentioned above. Moreover, it can learn more complex semantic information between the multi-modes, and promote the accuracy and generalization of the video representation learning. In order to evaluate our method, we perform action recognition and video retrieval transfer
learning on UCF101 and HMDB51. Our proposed method not only outperforms the state-of-the-art approaches but also leads to several valuable conclusions.
 

Keyword自监督学习,多模态内容理解,行为识别,深度聚类
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/48694
Collection多模态人工智能系统全国重点实验室_视频内容安全
毕业生_硕士学位论文
Recommended Citation
GB/T 7714
魏久桐. 基于多模态深度对比聚类的自监督视频行为识别研究[D]. 中国科学院自动化所. 中国科学院大学,2022.
Files in This Item:
File Name/Size DocType Version Access License
魏久桐-毕业论文(签名).pdf(4465KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[魏久桐]'s Articles
Baidu academic
Similar articles in Baidu academic
[魏久桐]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[魏久桐]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.