基于深度学习的视频目标检测方法研究

CASIA OpenIR > 多模态人工智能系统全国重点实验室 > 先进时空数据分析与学习

	基于深度学习的视频目标检测方法研究
	蒋正锴
	2020-05-29
页数	74
学位类型	硕士
中文摘要	目标检测是计算机视觉的一个重要研究方向，被广泛应用于视频监控、人脸检测以及自动驾驶等领域。近年来，基于深度学习的图像目标检测取得了巨大的成功。然而，由于视频中存在运动模糊、部分遮挡以及复杂的姿态等问题，逐帧地进行图像目标检测很难解决这些问题。视频中包含丰富的时序信息。当前，在深度学习的框架下，基于时序信息的视频目标检测是计算机视觉中的一个研究热点。面向视频目标检测任务，本文开展时空深度学习模型构建与优化的研究工作，充分利用视频中的时空信息来提高视频目标检测精度和速度。论文的主要贡献如下： 1. 提出一种基于局部加权可变邻域时空特征对齐的视频目标检测方法。首先，基于脑启发的特征记忆机制，通过在关键帧与关键帧之间传播和更新记忆特征来增强关键帧的特征鉴别能力；然后，利用局部加权可变邻域操作将关键帧特征在空间和时间轴上同时对齐到非关键帧，增强非关键帧的特征学习能力。在此过程中，进一步利用关键帧和非关键帧的低层级特征来学习局部区域的相似性权重和目标移动偏移量，并基于加权目标移动偏移量来最终实现运动目标的时空对齐。基于上述所提时空信息传播方法，论文构建了一个基于局部加权可变邻域时空特征对齐的视频目标检测深度学习模型。对比基于光流网络的特征传播方法，所提方法在提升视频目标检测精度的同时降低了深层神经网络模型的参数总量。 2. 提出一种基于时空采样特征对齐的视频目标检测方法。首先，基于关键帧像素时空采样，对关键帧与非关键帧对应像素的特征相似度进行估计，并利用该相似度对关键帧相应空间位置的特征进行线性加权；然后，在极小化目标检测损失函数的监督下，对采样点位置进行优化与更新，通过高层特征在不同帧间的传播来实现加权特征在空间和时间轴上的时空对齐。在此过程中，进一步引入关键帧稀疏递归特征更新策略和非关键帧稠密特征聚合策略来提升关键帧和非关键性的特征表达能力，从而实现高精度的特征时空对齐。基于上述所提方法，论文构建了一个时空采样特征对齐的视频目标检测深度学习模型。在视频目标检测的基准数据集ImageNet VID 上的对比实验验证了所提方法的有效性。
英文摘要	Object detection is an important research direction in the field of computer vision, which is widely used in many tasks, including video surveillance, face recognition, automatic driving, and so on. Recently, image object detection based on deep learning has made great progress. However, due to motion blur, partial occlusion and rare poses in the video, it is difficult to solve these problems by existing image detectors only through frame-by-frame processing. Actually, the video contains rich temporal cues. Currently, object detection in videos with the effective use of the inherent temporal information under the framework of deep learning is a hot topic in the field of computer vision. This thesis studies the construction and optimization of spatio-temporal deep learning model, which aims to use spatio-temporal cues in the video to improve the accuracy and speed of object detection. The main contributions of this thesis are as follows: 1. This thesis presents a video object detection method based on Locally-Weighted Deformable Neighbours, LWDN for spatio-temporal feature alignment. First, inspired by human-brain mechanism, we propose to iteratively propagate and update the memory feature between the keyframes to improve the discriminative ability of the keyframe features. Then, LWDN is used to align the keyframe features to the non-keyframes in both image space and time axis to enhance the feature learning ability of the detector from the non-keyframes. In this process, low-level features of both keyframes and nonkeyframes are utilized to learn the similarity weights and object location offsets. Finally, feature alignment is implemented by linearly weighted object location offsets. Accordingly,a deep learning model is developed for video object detection, which integrates the above techniques together for end-to-end training. Compared with flow-warping based method, our proposed method achieves better performance with fewer parameters. 2. This thesis presents a video object detection method based on Learnable Spatio-Temporal Sampling, LSTS for spatio-temporal feature alignment. First, the feature similarities between the pixels in the non-keyframe and their corresponding pixels that are obtained in the keyframes with the location sampling strategy are estimated. Accordingly,these similarities are employed to linearly weighten the features of sampled pixels in the keyframes for further pixel location alignment. Then, under the help of the supervision of minimizing the object detection loss, the sampled locations are optimized and updated iteratively. The spatio-temporal alignment of features on space and time axis is realized by the propagation of high-level features between different video frames. In this process, we further propose an approach of Sparsely Recursive Feature Updating, SRFU and Dense Feature Aggregation, DFA to improve feature representation of the whole video frames, which helps achieve more accurate spatio-temporal correspondences. Based on the proposed spatio-temporal feature propagation method, a deep learning model of video object detection is established. Experimental results on the ImageNet VID benchmark demonstrate the effectiveness of our proposed method.
关键词	深度学习视频目标检测时空特征对齐时空像素采样
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/39268
专题	多模态人工智能系统全国重点实验室_先进时空数据分析与学习
通讯作者	蒋正锴
推荐引用方式 GB/T 7714	蒋正锴. 基于深度学习的视频目标检测方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Master_Thesis.pdf（6251KB）	学位论文		开放获取	CC BY-NC-SA