面向视频物体检测及分割的时序特征建模

CASIA OpenIR > 毕业生 > 博士学位论文

	面向视频物体检测及分割的时序特征建模
	何飞
	2023-05-17
页数	141
学位类型	博士
中文摘要	视频中的物体检测和实例分割是计算机视觉领域的基础问题，在视频处理与分析中发挥着关键作用，具有重要的研究意义和广泛的实用价值。目前物体检测和实例分割算法的发展主要集中于静态图像分析，而在视频监控、互联网和自动驾驶等实际应用场景中，基于视频的物体检测和实例分割具有更为迫切的实际需求。然而，将最先进的图像物体检测和实例分割算法直接应用于视频每一帧时，仍然面临着新的挑战。在检测精度方面，视频中存在运动模糊、镜头失焦、遮挡和怪异视角等致使物体外观恶化的因素，使得图像物体检测算法难以准确识别其中的物体，从而降低了检测的精度。在检测速度方面，大多数图像物体检测算法的运行速度仍然较慢，但是一些实际应用场景对物体检测算法往往具有实时性要求，如驾驶场景和工业应用场景。如果在视频中的每一帧都单独运行图像物体检测算法，将无法满足实时性要求。在视频实例分割方面，为了实现逐帧的实例分割和帧间的实例关联，先前的方法需要在图像实例分割模型基础上添加额外的多目标跟踪模型，这增加了系统的复杂度，并且多个任务阶段之间难以联合优化，从而限制了模型的性能上限。然而，视频相比静态图像包含更丰富的时序信息，如何有效地建模时序特征是解决上述挑战的关键。因此，本文以时序特征建模为出发点，对视频中的物体检测和实例分割开展了以下研究工作： 1. 基于特征自适应聚合的高精度视频物体检测。为提升在外观恶化场景中的视频物体检测准确率，已有方法采用聚合固定邻域的时序特征的策略来增强恶化帧的特征质量。然而视频具有高度冗余性和无序变化性，机械式特征聚合策略效率低且容易引入额外噪声，影响特征增强效果。为此，本文提出了一种特征自适应聚合方法，模型首先估计当前帧中物体的运动信息用于灵活选择聚合帧，然后从聚合帧的特征中动态采样高质量特征以提升当前帧的特征质量，进而提升物体检测的准确率。该方法让模型自适应地利用时序信息，实现高效的特征增强。在每帧聚合较少视频帧的同时，即可达到当时最高的视频物体检测精度。 2. 基于对象特征传播的快速视频物体检测。大多数图像物体检测算法都难以满足实时运行的需求。在视频中，物体随着时间发生缓慢移动或形变，因此相邻帧的外观非常相似。本文基于这一观察，提出了一个对象特征传播框架进行快速视频物体检测。具体而言，视频序列被动态地分为关键帧和非关键帧，模型将稀疏的关键帧上的对象特征传播至稠密的非关键帧，减少非关键帧的冗余计算量；模型将历史关键帧的对象特征传播至当前关键帧，通过对象关系建模提升对象特征的质量，提高检测精度。与之前基于运动估计模型的特征图传播或边界框传播方法不同，本文的对象特征传播方法仅需轻量的对象特征注意力即可实现特征传播，并且与检测网络共享计算量。大量实验表明，该方法显著地提升了视频物体检测的精度和速度，验证了对象特征传播的高效性。 3. 基于时序一致实例特征的在线视频实例分割。视频实例分割旨在同时完成视频每帧的实例分割和帧间的实例关联。已有方法通常采用显式的实例关联策略，即使用实例分割模型获取每帧的实例掩码，然后再使用额外的多目标跟踪模型进行帧间实例关联。该方法存在较高的模型复杂度，同时实例分割和实例关联的分开建模，也导致多任务信息交互不足，无法充分利用时序线索。因此，本文提出了基于时序一致实例特征的视频实例分割框架。该框架利用实例查询与候选框的时序传播机制，充分挖掘和利用时序信息，学习具有时序一致性的实例特征，以对实例分割和实例关联进行统一建模，从而执行高效的隐式实例关联。实验结果表明，相较于显式实例关联方案，本方法在精度和速度方面均有显著提升，验证了联合建模实例分割和实例关联的有效性。
英文摘要	Object detection and instance segmentation in videos are fundamental problems in the field of computer vision, playing a critical role in video processing and analysis, with important research significance and practical value. Currently, the development of object detection and instance segmentation algorithms mainly focuses on static image analysis. However, in practical application scenarios such as video surveillance, the internet, and autonomous driving, video-based object detection and instance segmentation have more urgent practical demands. However, applying state-of-the-art image object detection and instance segmentation algorithms directly to each frame of a video still faces new challenges. In terms of detection accuracy, factors such as motion blur, lens defocus, occlusion, and strange viewing angles in videos can deteriorate the appearance of objects, making it difficult for image object detection algorithms to accurately recognize them, thus reducing detection accuracy. In terms of detection speed, the operating speed of most image object detection algorithms is still relatively slow, but some practical application scenarios often require real-time object detection algorithms, such as driving and industrial applications. If image object detection algorithms are run separately on each frame of a video, real-time requirements cannot be met. For video instance segmentation, previous methods require additional multi-object tracking models to be added to the image instance segmentation model to achieve frame-by-frame instance segmentation and inter-frame instance association, increasing system complexity, and making it difficult to jointly optimize between multiple task stages, thereby limiting the performance ceiling of the model. However, videos contain richer temporal information than static images, and how to effectively model temporal features is the key to solving the above challenges. Therefore, this paper starts with modeling temporal features and carries out the following research work on object detection and instance segmentation in videos. 1. High-precision video object detection based on feature-adaptive aggregation. To improve the accuracy of video object detection in scenarios with degraded appearances, existing methods adopt the strategy of aggregating temporal features from fixed neighborhoods to enhance the feature quality of degraded frames. However, videos have high redundancy and unordered variability, and mechanical feature aggregation strategies are inefficient and prone to introducing additional noise, which affects the feature enhancement effect. Therefore, this article proposes a feature adaptive aggregation method. The model first estimates the motion information of objects in the current frame to flexibly select aggregation frames, and then dynamically samples high-quality features from the aggregated frame's features to enhance the feature quality of the current frame, thereby improving the accuracy of object detection. This method enables the model to adaptively utilize temporal information and achieve efficient feature enhancement. By aggregating fewer video frames per frame, the method achieves the highest video object detection accuracy at that time. 2. High-speed video object detection based on object feature propagation. Most object detection algorithms for images struggle to meet the real-time running requirements. In videos, objects move or deform slowly over time, so the appearance of adjacent frames is very similar. Based on this observation, this article proposes an object feature propagation framework for fast video object detection. Specifically, the video sequence is dynamically divided into key frames and non-key frames. The model propagates the sparse object features on key frames to dense non-key frames to reduce redundant computations on non-key frames. The model also propagates the object features of historical key frames to the current key frame and improves the quality of object features through object relationship modeling, thereby enhancing detection accuracy. Unlike previous feature map propagation or bounding box propagation methods based on motion estimation models, the proposed object feature propagation method only requires lightweight object feature attention to achieve feature propagation and shares computation with the detection network. A large number of experiments have shown that this method significantly improves the accuracy and speed of video object detection and verifies the efficiency of object feature propagation. 3. Online video instance segmentation based on temporally consistent instance features. Video instance segmentation aims to simultaneously achieve instance segmentation for each frame of the video and inter-frame instance association. Existing methods typically adopt an explicit instance association strategy, which involves using an instance segmentation model to obtain instance masks for each frame, followed by using an additional multi-target tracking model for inter-frame instance association. This approach has a high model complexity, and the separate modeling of instance segmentation and instance association leads to insufficient multi-task information interaction, making it difficult to fully utilize temporal clues. Therefore, this paper proposes a video instance segmentation framework based on temporally consistent instance features. The framework uses a temporal propagation mechanism for instance queries and candidate boxes, fully mining and utilizing temporal information, learning instance features with temporal consistency, and unifying the modeling of instance segmentation and instance association to perform efficient implicit instance association. Experimental results show that compared to explicit instance association methods, this approach significantly improves both accuracy and speed, validating the effectiveness of jointly modeling instance segmentation and instance association.
关键词	时序特征建模视频物体检测视频实例分割特征聚合特征传播
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51935
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	何飞. 面向视频物体检测及分割的时序特征建模[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
202018014628089何飞.pd（38515KB）	学位论文		限制开放	CC BY-NC-SA