基于深度学习的视频目标检测方法研究
蒋正锴
2020-05-29
页数74
学位类型硕士
中文摘要

目标检测是计算机视觉的一个重要研究方向,被广泛应用于视频监控、人脸检测以及自动驾驶等领域。近年来,基于深度学习的图像目标检测取得了巨大的成功。然而,由于视频中存在运动模糊、部分遮挡以及复杂的姿态等问题,逐帧
地进行图像目标检测很难解决这些问题。视频中包含丰富的时序信息。当前,在
深度学习的框架下,基于时序信息的视频目标检测是计算机视觉中的一个研究热点。面向视频目标检测任务,本文开展时空深度学习模型构建与优化的研究工
作,充分利用视频中的时空信息来提高视频目标检测精度和速度。论文的主要贡
献如下:
1. 提出一种基于局部加权可变邻域时空特征对齐的视频目标检测方法。首
先,基于脑启发的特征记忆机制,通过在关键帧与关键帧之间传播和更新记忆
特征来增强关键帧的特征鉴别能力;然后,利用局部加权可变邻域操作将关键
帧特征在空间和时间轴上同时对齐到非关键帧,增强非关键帧的特征学习能力。
在此过程中,进一步利用关键帧和非关键帧的低层级特征来学习局部区域的相
似性权重和目标移动偏移量,并基于加权目标移动偏移量来最终实现运动目标
的时空对齐。基于上述所提时空信息传播方法,论文构建了一个基于局部加权可
变邻域时空特征对齐的视频目标检测深度学习模型。对比基于光流网络的特征
传播方法,所提方法在提升视频目标检测精度的同时降低了深层神经网络模型
的参数总量。
2. 提出一种基于时空采样特征对齐的视频目标检测方法。首先,基于关键
帧像素时空采样,对关键帧与非关键帧对应像素的特征相似度进行估计,并利用
该相似度对关键帧相应空间位置的特征进行线性加权;然后,在极小化目标检测
损失函数的监督下,对采样点位置进行优化与更新,通过高层特征在不同帧间的
传播来实现加权特征在空间和时间轴上的时空对齐。在此过程中,进一步引入关
键帧稀疏递归特征更新策略和非关键帧稠密特征聚合策略来提升关键帧和非关
键性的特征表达能力,从而实现高精度的特征时空对齐。基于上述所提方法,论
文构建了一个时空采样特征对齐的视频目标检测深度学习模型。在视频目标检
测的基准数据集ImageNet VID 上的对比实验验证了所提方法的有效性。

英文摘要

Object detection is an important research direction in the field of computer vision, which is widely used in many tasks, including video surveillance, face recognition, automatic driving, and so on. Recently, image object detection based on deep learning has made great progress. However, due to motion blur, partial occlusion and rare poses in the video, it is difficult to solve these problems by existing image detectors only through frame-by-frame processing. Actually, the video contains rich temporal cues. Currently, object detection in videos with the effective use of the inherent temporal information under the framework of deep learning is a hot topic in the field of computer vision. This thesis studies the construction and optimization of spatio-temporal deep learning model, which aims to use spatio-temporal cues in the video to improve the accuracy and speed of object detection.


The main contributions of this thesis are as follows:


1. This thesis presents a video object detection method based on Locally-Weighted Deformable Neighbours, LWDN for spatio-temporal feature alignment. First, inspired by human-brain mechanism, we propose to iteratively propagate and update the memory feature between the keyframes to improve the discriminative ability of the keyframe features. Then, LWDN is used to align the keyframe features to the non-keyframes in both image space and time axis to enhance the feature learning ability of the detector from the non-keyframes. In this process, low-level features of both keyframes and nonkeyframes are utilized to learn the similarity weights and object location offsets. Finally, feature alignment is implemented by linearly weighted object location offsets. Accordingly,a deep learning model is developed for video object detection, which integrates the above techniques together for end-to-end training. Compared with flow-warping based method, our proposed method achieves better performance with fewer parameters.


2. This thesis presents a video object detection method based on Learnable Spatio-Temporal Sampling, LSTS for spatio-temporal feature alignment. First, the feature similarities between the pixels in the non-keyframe and their corresponding pixels that are obtained in the keyframes with the location sampling strategy are estimated. Accordingly,these similarities are employed to linearly weighten the features of sampled pixels in the keyframes for further pixel location alignment. Then, under the help of the supervision of minimizing the object detection loss, the sampled locations are optimized and updated iteratively. The spatio-temporal alignment of features on space and time axis is realized by the propagation of high-level features between different video frames. In this process, we further propose an approach of Sparsely Recursive Feature Updating, SRFU and Dense Feature Aggregation, DFA to improve feature representation of the whole video frames, which helps achieve more accurate spatio-temporal correspondences. Based on the proposed spatio-temporal feature propagation method, a deep learning model of video object detection is established. Experimental results on the ImageNet VID benchmark demonstrate the effectiveness of our proposed method.

关键词深度学习 视频目标检测 时空特征对齐 时空像素采样
语种中文
七大方向——子方向分类目标检测、跟踪与识别
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/39268
专题多模态人工智能系统全国重点实验室_先进时空数据分析与学习
通讯作者蒋正锴
推荐引用方式
GB/T 7714
蒋正锴. 基于深度学习的视频目标检测方法研究[D]. 中国科学院自动化研究所. 中国科学院大学,2020.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Master_Thesis.pdf(6251KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[蒋正锴]的文章
百度学术
百度学术中相似的文章
[蒋正锴]的文章
必应学术
必应学术中相似的文章
[蒋正锴]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。