With the development of mobile Internet and multimedia technology, the rise of a large number of video applications has led to an explosive growth of online video data. As one of the mainstream information transmission methods, video has been widely used in many fields, such as digital media, science and technology education, security monitoring, and so on. However, while meeting the needs of users, a large number of videos have brought huge challenges to the organization, categorization
and application. Although existing supervised or semi-supervised video classification methods have achieved significant performance, such methods usually require a large amount of high-quality label data for model learning. Therefore, how to accurately obtain valid information in unlabeled video data, and explore the essential structural characteristics and category distribution of video data has become a hot research topic
in the field of computer vision and artificial intelligence.
This thesis mainly focuses on the area of self-supervised video action recognition based on multimodal features learning, creatively proposing a contrastive learning framework based on deep clustering. Firstly, according to the characteristics of the video feature matrix, we divide contrastive learning into instance-level comparison and cluster-level comparison. For the problem of background deviation in videos,
a positive sample augmentation method based on background noise is designed in the instance-level contrastive module. For the semantic deviation in instance-level contrastive learning, we design cluster-level contrastive learning for supplementation. At last, we apply the framework to multimodal audio and video fusion. We conduct a variety of transfer learning experiments to demonstrate the effectiveness of our approaches.
More specifically, the main contributions of this thesis are summarized as follows:
• We propose a data augmentation method based on background noise and apply it to video self-supervised learning. The action in the video, and the context in which it happens, tend to be highly dependent. In this case, the traditional neural network does not need to learn complex spatiotemporal semantic information, but only relies on spatial information to achieve a certain classification accuracy. Such background
bias will lead to insufficient generalization of the model. Inspired by the idea of Mixup in the image field, we randomly sample a static frame from the training video and overlay it to each frame of the training clip, then we use this augmented positive sample for contrastive learning. Though the spatial information of the augmented video has changed significantly, the static frame is still similar to the pixel distribution of the training sample, and the augmentation does not affect the optical flow too much.
Therefore, it ensures that the semantics of the augmented sample will not be greatly compromised, prompting the model to delve into the spatiotemporal semantics of video. We verified the effectiveness of this augmentation on action recognition and video retrieval tasks.
• We propose a novel cluster-level contrastive learning method for video understanding. Most existing contrastive learning methods for video representation learning sampled negative clips from different videos. However, this sampling strategy tends to push videos with similar semantics away from each other in the feature space, resulting in semantics confusion. In order to solve this problem, we design a cluster-level contrastive loss by the idea of feature matrix column dimensions representing class information. In particular, each video is represented as a one-dimensional row vector using an encoder, resulting in a two-dimensional feature matrix. Then, the column vectors of the feature matrix are used for contrastive learning. Negative clips sampled in this way are no longer limited to different instances but are compared with higher-level cluster semantics, prompting the model to learn and refine features at different semantical levels. We perform action recognition and video retrieval transfer learning on two public datasets and achieve comparable performance to the state-of-the-art approaches.
• We propose a self-supervised video action recognition framework based on multimodal deep clustering, by combining the audio and video with cluster-level contrastive learning. Most existing multimodal based methods usually map multimodal data to the same feature space for alignment. However, the semantic information about multimodal
data is complementary rather than consistent. The above alignment operation actually destroys the prevalence distribution of multimodal data in the feature space. Therefore, we apply cluster-level contrastive learning to the multimodal domain of audio-video interaction. Specifically, we calculate the feature matrix of audio and RGB input respectively and do cluster-level contrastive learning across each other. This multimodal
contrastive learning on the cluster-level avoids the alignment problem mentioned above. Moreover, it can learn more complex semantic information between the multi-modes, and promote the accuracy and generalization of the video representation learning. In order to evaluate our method, we perform action recognition and video retrieval transfer
learning on UCF101 and HMDB51. Our proposed method not only outperforms the state-of-the-art approaches but also leads to several valuable conclusions.