CASIA OpenIR  > 毕业生  > 博士学位论文
复杂场景中的运动目标检测方法研究
陈盈盈1,2
学位类型工学博士
导师卢汉清
2018-05
学位授予单位中国科学院研究生院
学位授予地点北京
关键词运动目标检测 背景减除法 背景建模 深度学习
摘要随着图像和视频作为最直观的视觉内容和信息承载媒介被普遍采用,如何自动分析与理解海量视觉数据从而充分利用其中有价值的信息,是当前亟待解决的计算机视觉领域的科学问题。其中,运动目标检测是将人们关注的运动目标(前景)从不受关注的场景(背景)中以像素精度提取出来,是视觉信息智能化处理的第一步。同时,运动目标检测也是各种高层视频处理及应用理解(如目标识别与跟踪、行为识别和分析、视频编码、人机交互等)的基础技术,直接影响整个系统的最终性能。
虽然众多研究人员已经在运动目标检测领域进行了多年的科研探索,但是仍然存在许多困难和挑战。首先,运动目标检测需要有效且通用的框架处理复杂的实际场景。当前方法通常基于特定应用场景进行算法设计,针对静态场景和动态场景应用不同的框架结构。而实际的复杂场景往往同时包括静态和动态场景,给运动目标检测带来了挑战。其次,运动目标检测需要高效的特征描述。当前常用的颜色、纹理和边缘等底层视觉特征由于缺乏高层语义先验,无法有效辨别复杂场景中的运动目标并抑制背景噪声的干扰。因此,探索更有效的特征表达至关重要。本文针对复杂场景下的运动目标检测问题,探索视频中像素级的空间连续关系和时序变化模式,提出了两个通用框架:基于像素模型共享的框架和基于深度序列学习的框架。它们能够应对视频中复杂的场景变化,同时处理静态和动态场景建模问题。此外,本文引入具有判别力的神经网络特征,探索高层语义信息和底层视觉特征的有效融合从而得到高效的特征表达。本文主要的研究内容和创新贡献归纳如下:
1. 提出一种基于像素模型共享的运动目标检测框架
传统的背景减除法是基于相邻像素在空间上独立的假设,忽略了视频序列中像素间的空间连续性,导致这些方法对背景噪声敏感并且其前景目标内部易出现“空洞”现象。此外,逐个像素建立背景模型会造成模型大量冗余,增加了算法的空间复杂度和时间复杂度。本文提出一种基于模型共享的运动目标检测框架,构建了像素与周围背景模型间多对一的动态匹配模式,即每个像素在共享区域内搜索最佳匹配模型。不同像素在当前时刻可以共用同一个背景模型,下一时刻可自由匹配其他背景模型。本模型共享框架可无缝嵌入各种像素级建立背景模型的传统方法,我们在混合高斯模型和样本一致性模型上进行了实验和验证,实验结果表明共享策略不仅降低了约三分之二的模型数量,还增强了背景模型中样本的多样性从而增强对背景噪声的鲁棒性,有效提升了运动目标检测的性能。
2. 提出一种基于语义敏感的运动目标检测方法。
传统的运动目标检测方法在建立背景模型时通常使用一种或多种底层视觉特征,如颜色特征、纹理特征和边缘特征等。然而这些特征缺乏对人们注意力特点的考虑,难以去除动态背景因素导致的误检,也无法区别视觉特征相似的前景和背景。如果可以提取前景目标和背景场景的更高层次的语义信息,将对运动目标检测任务产生巨大影响。本文探索了在大数据集上,用深度编解码网络进行前背景语义信息学习,并在测试集上离线提取语义特征。然后利用颜色特征与语义特征优势互补,在融合后得到了描述力更强的特征表示。本文提出的方法对视频场景中的恶劣天气、多变的光照条件具有较强的鲁棒性,通过特定场景微调法使之更好地适应新场景。本文提出的运动目标检测框架具备良好的适应性和扩展性,实验表明算法性能相比传统方法得到大幅提升。
3. 提出了一种基于像素级深度序列学习网络的运动目标检测方法
由于视频场景变化多、差异大使得像素的空间关系较为复杂,传统方法难于联合视频中的空间关系和时序变化同时为前景和背景建模,因此通常基于视频场景中每个像素的时序变化建立像素级背景模型。本文改变解决问题的思路和角度,将运动目标检测作为一个像素级的序列学习任务,先由语义特征提取网络得到具有分辨力的特征图,再由本文提出的基于注意力的卷积长短期记忆网络联合建模视频序列中的空间关系和时序变化。基于像素级深度序列学习方法本文提出的是一种新颖的端到端通用场景的运动目标检测框架,其提取的高层语义信息具有不受相机运动影响的特性,因此可以应对不同复杂场景的运动目标检测任务。实验证明提出的方法在静态场景和动态场景的数据集上具有良好的性能表现。

其他摘要    As the images and videos are widely used as the intuitive visual content and information propagation media, how to automatically analyze and understand massive visual data to make full use of valuable information is a scientific problem needed to be resolved in the field of computer vision. Moving object detection is to extract the moving objects (foreground) that people interest in from the unconcerned scene (background) with pixel accuracy, which is the first step in the intelligent processing of visual information. Meanwhile, moving object detection is also the basic technology for various high-level video processings and applications (such as object recognition and tracking, behaviour identification and analysis, video coding, human-computer interaction, etc.), and directly affects the final performance of the entire system.
    Although many researchers have conducted scientific exploration for many years in the field of moving object detection, there are still many difficulties and challenge. Fist, moving object detection requires an efficient and versatile framework to handle complex real-world scenes. Existing methods usually design algorithms based on specific scenes and apply different frameworks for static scenes and dynamic scenes. However, the actual complex scenes often include both static and dynamic scenes, which poses a challenge to the moving object detection. Second, moving object detection requires efficient feature description. Low-level visual features, such as color, textures, and edges, have been widely used. Lacking high-level semantic prior, they cannot effectively distinguish moving object objects from complex scenes and suppress the interference of background noise. Therefore, it is very important to explore more effective feature expressions. This paper aims at the moving object detection in complex scenes, and explores the pixel-level spatial continuation relationship and temporal evolution patterns in video sequences. We propose two general frameworks: a unified model sharing framework and a deep sequence learning framework. They are able to cope with complex scene changes in the video sequences and model both static and dynamic scenes simultaneously. In addition, this paper introduces discriminating features of the deep neural network, and explores the effective fusion of high-level semantic information and low-level visual features so as to obtain efficient feature expression. The main contributions are summarized as follows:
    1. A Unified Model Sharing Framework for Moving Object Detection
    The traditional background subtraction methods are often based on the assumption that neighboring pixels are spatially independent. Ignoring the spatial continuity between pixels in video sequences, these methods are sensitive to background noise and have the “hole” within the foreground object sometimes. In addition, using pixel-level background model results in a lot of redundancy of the model, which increases the space complexity and time complexity. This paper proposes a unified model sharing framework for moving object detection and constructs a many-to-one dynamic matching pattern between pixels and their surrounding background models. That is, each pixel searches for the best matching model in the shared area. Different pixels can share the same background model at the current frame, and they can freely match other background models at the next frame. The model sharing framework can be seamlessly embedded in the traditional methods that establish pixel-level background models. Experiments and verifications have been performed on the Gaussian mixture model and the sample consistency model. The experimental results show that the sharing strategy not only reduces about two hirds model, but also increase the diversity of samples in the background model, which enhances the robustness to background noise and effectively improves the performance of moving target detection.
    2. A Semantic-aware Moving Object Detection
    Traditional methods of moving object detection usually utilize one or more low-vel visual features such as color features, texture features, and edge features when creating the background models. However, these features lack the consideration of people’s attention characteristics, so it is difficult to remove the false detection caused by dynamic background factors and it is impossible to distinguish the foreground and background that have similar visual features. If the higher-level semantic information of the foreground objects and the background scenes can be extracted, it will have a great influence on moving object detection task. This paper designs a deep encoder-decoder network to learn semantic information of the foreground and background on the train set and extracts semantic features off-line on the test set. Then we utilize the complementary of the color feature and the semantic feature to obtain discriminative feature with the fusion strategy. The method proposed in this paper is robust to bad weather and illumination change conditions of video scenes. Moreover, we introduce domain-specific finetuning strategy to make model adapt to new scenes. The proposed framework has good adaptability and expansibility. Experiments show that the performance is greatly improved compared to the traditional methods.
    3. Pixel-wise Deep Sequence Learning for Moving Object Detection
The actual video scenes have various foreground and background changes, which makes the spatial relationship of the pixels pretty complicated. The traditional methods have difficulty to model the foreground and the background simultaneously, jointly using the spatial and temporal relationship in the video. Therefore, the pixellevel background model is usually established based on the temporal evolution of each pixel. From a new perspective, this paper takes moving object detection as a pixel-level sequence learning task. Firstly, a semantic feature extraction network is proposed to obtain discriminative feature maps. Then, the proposed attention ConvLSTM can jointly model spatial continuity and temporal changes in video sequences. Based on the pixel-level deep sequence learning method, this paper proposes a novel end-to-end universal moving object detection framework. The extracted high-level semantic information is not affected by the camera motion, so it can deal with moving object detection in the different complex scenes. Experiments show that the proposed method has good performance in both static scenes and dynamic scene.
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/21045
专题毕业生_博士学位论文
作者单位1.中国科学院自动化研究所
2.中国科学院大学
推荐引用方式
GB/T 7714
陈盈盈. 复杂场景中的运动目标检测方法研究[D]. 北京. 中国科学院研究生院,2018.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis.pdf(15104KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[陈盈盈]的文章
百度学术
百度学术中相似的文章
[陈盈盈]的文章
必应学术
必应学术中相似的文章
[陈盈盈]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。