CASIA OpenIR  > 毕业生  > 硕士学位论文
基于卷积神经网络的视频压缩域高效行为识别方法
关俊洋
2022-05-20
Pages74
Subtype硕士
Abstract

      视频人体行为识别是计算机视觉领域的重要基础任务,在自动驾驶、人机交互、智能监控、安防等领域具有重要应用前景,因此受到学术界、工业界等科研及工程技术人员的广泛关注。近年来,卷积神经网络的兴起,掀起了一场计算机视觉领域的人工智能革命,也极大地促进了视频人体行为识别的发展。由于视频中有很多冗余信息,基于卷积神经网络的视频行为识别算法通常需要大量的计算资源,不利于在实际应用中落地。由于互联网及硬盘上的视频都是经过压缩编码的,在处理中对视频进行部分解码,不仅可以节省全解码带来的计算开销,还可以得到表示视频表观及动态信息的I帧、运动矢量、残差等视频压缩域信息,节省了基于全解码后的图片序列进一步计算光流等动态信息的计算负担。因此,视频压缩域信息具有天然的高效性,研究基于视频压缩域信息的行为识别算法具有重要的理论意义和应用价值。

      本论文以实现高效的行为识别算法为出发点,研究了基于卷积神经网络的视频压缩域行为识别技术。论文的主要工作和创新点归纳如下:

      1.视频H264压缩域信息部分解码与抽取

      基于视频压缩域信息实现行为识别算法的前提是获取视频压缩域信息,比如I帧、运动矢量、残差、量化参数等。考虑到当今互联网上超过90\%的视频使用H264编解码标准,从项目需求和实际应用的角度出发,本文选择研究并提取视频H264压缩域信息。首先,研究H264编解码标准的原理,然后,基于FFmpeg编解码框架,梳理H264视频编解码流程,编写代码提取压缩域中I帧、运动矢量、残差、宏块分割模式、量化参数等信息。

      2.基于压缩域运动矢量指导的自适应采样行为识别算法

      视频行为识别算法的一个基础步骤是帧采样。视频中运动信息沿时间维度分布不均衡,动态目标集中于某些时间段内。传统等间隔采样或者基于段的采样认为每帧图片都是等价的,忽略了视频中运动信息分布不均衡的现实。而现有基于运动分布的采样方法计算复杂度高,不利于在实际应用中落地。因此,本文提出基于运动矢量指导的自适应采样行为识别算法。1)使用视频压缩域中的运动矢量衡量每帧的运动信息,计算出视频中时间维度的运动分布。2)基于运动分布,在运动信息集中的地方采样较多数量的帧,在运动信息稀疏的地方采样较少的帧,使采样结果能够自适应地捕捉到视频中动作的发生过程。由于只需部分解码就可获得视频压缩域中的运动矢量,此方法实现了极快的速度。并且在多个国际公开数据集上的测试性能得到了显著提升,表明了本文提出的压缩域运动矢量指导的自适应采样方法的有效性。

      3.基于时空特征融合和时序差分注意力机制网络的压缩域行为识别算法

      视频压缩域各类信息具有互补性,传统基于视频压缩域的行为识别算法大多单独处理每种压缩域信息,没有对其充分融合。因此,本文提出基于时空特征融合和时序差分注意力机制网络的压缩域行为识别算法。1)时空特征融合网络层以视频压缩域的I帧和残差作为输入,在深度网络特征层面融合I帧对应的空间特征和残差对应的局部时序动态特征,得到视频的局部时空特征。2)使用时序差分注意力机制计算相邻I帧对应的局部时空特征之差,并将其转化成注意力权重,增强局部时空特征,然后,使用shift模块处理局部时空特征,提取视频全局时空特征。本文提出的算法在UCF101和HMDB51上取得了当前最好的性能,证明了其有效性。

       总的来说,本文分析了视频压缩域行为识别中的一些关键问题,并对其展开深入研究,提出的算法显著提高了视频压缩域行为识别的性能。同时,本文实现的视频压缩域行为识别算法已经在国家计算机网络和信息安全管理中心得到了实际应用。

Other Abstract

  Human action recognition in video, as an important basic task in the field of computer vision, has important application prospects in the fields of automatic driving, human-computer interaction, intelligent monitoring, security and other fields. Therefore, it has received extensive attention from scientific research and engineering technicians in academia and industry. In recent years, the rise of convolutional neural networks has set off an artificial intelligence revolution in the field of computer vision, and has also greatly promoted the development of human action recognition in video. Because there is a lot of redundant information in the video, action recognition based on the convolutional neural network requires a lot of computing resources, which is not friendly to practical applications. Since videos on the Internet and hard disk are all compressed and encoded, partial decoding of the video can not only save the computational overhead caused by full decoding, but also obtain video compressed domain information such as I-frames, motion vectors, and residuals that represent the apparent and dynamic information of the video, saving computational consumption caused by obtaining dynamic information such as optical flow. Therefore, video compressed domain information is naturally efficient, and research on action recognition based on video compressed domain information has important theoretical significance and application value.

  In order to realize the efficient action recognition algorithm, we studied the action recognition technology in the video compressed domain based on convolutional neural network. The main work and innovations of this thesis are summarized as follows:

  1.Partial Decoding and Extraction of Video H264 Compressed Domain Information

  In order to implement action recognition algorithms based on information in video compressed domain, it is necessary to obtain them, such as I-frames, motion vectors, residuals, and quantization parameters. Considering that more than 90\% of videos on the Internet today use the H264 standard, from the perspective of project and practical applications, we choose to study and extract information in H264 compressed domain. First, the H264 standard is studied. Then, based on the FFmpeg framework, the H264 decoding process is combed, and the codes is implemented to extract information such as I-frame, motion vector, residual, mode of macroblock segmentation, and quantization parameters in the compressed domain.

  2.Action Recognition Based on Adaptive Sampling Guided by Motion Vectors in Compressed Domain

  Sampling is fundamental in video action recognition. The distribution of motion in the video is uneven along the time dimension, and the dynamic objects are concentrated in local time period. Traditional equal-step sampling or segment-based sampling consider that each frame of video is equivalent, ignoring the reality of uneven distribution of motion in video. However, the existing sampling methods based on motion distribution have high computational complexity, which is not friendly to practical applications. Therefore, we propose an action recognition algorithm based on adaptive sampling guided by motion vectors in compressed domain. 1) Use motion vectors in the video compressed domain to measure the motion of each frame, and calculate the motion distribution in the temporal dimension of video. 2) Based on the motion distribution, more frames are sampled where the motion information is concentrated, and less frames are sampled where the motion information is sparse, so that the sampling result can adaptively capture the occurrence process of the action in the video. This method achieves extremely fast speeds because only partial decoding is required to obtain motion vectors in the video compressed domain. Test performance on multiple international public datasets has been significantly improved, demonstrating the effectiveness of the adaptive sampling method guided by motion vectors in the compressed domain.

  3. Video Compressed Domain Action Recognition Based on Spatio-temporal Feature Fusion and Temporal Difference Attention Mechanism Network

  All kinds of information in the video compressed domain is complementary, and most of traditional action recognition algorithms based on the video compressed domain deal with each kind of compressed domain information separately and does not fully integrate them. Therefore, we propose a compressed domain action recognition algorithm based on spatiotemporal feature fusion and temporal difference attention mechanism network. 1) The layer of spatiotemporal feature fusion takes the I frame and residual in the video compressed domain as input, and fuses the spatial feature corresponding to the I frame and the local temporal dynamic feature corresponding to the residual at the deep network feature level to obtain the local spatiotemporal feature of the video. 2) The temporal difference attention mechanism is used to calculate the difference between the local spatiotemporal features corresponding to adjacent I frames, and convert it into attention weights to enhance the local spatiotemporal features. Then, the shift module is used to process the local spatiotemporal features to extract the global spatiotemporal features of the video. At the time, the algorithm proposed by us achieves the state-of-the-art performance on UCF101 and HMDB51, proving its effectiveness.

  In general, we analyzed some key problems in video compressed domain action recognition, and conducted in-depth research on them. At the same time, the video compressed domain action recognition algorithms implemented by us have been practically applied in the National Computer Network and Information Security Management Center.

Keyword卷积神经网络+视频行为识别+视频压缩域
Subject Area计算机科学技术 ; 计算机感知
MOST Discipline Catalogue工学::计算机科学与技术(可授工学、理学学位)
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/48888
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
关俊洋. 基于卷积神经网络的视频压缩域高效行为识别方法[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
Files in This Item:
File Name/Size DocType Version Access License
基于卷积神经网络的视频压缩域高效行为识别(26160KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[关俊洋]'s Articles
Baidu academic
Similar articles in Baidu academic
[关俊洋]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[关俊洋]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.