CASIA OpenIR  > 智能感知与计算
基于信息传递的视觉目标检测
陈韫韬
2021-06
Pages120
Subtype博士
Abstract

目标检测是计算机视觉中最基本的任务之一,其目的是识别并定位图像中 预先给定的类别范围内的物体。从实际应用的角度来说,目标检测在公共安全、 视频监控、自动驾驶、异常行为识别、视频理解与检索等多个应用领域中都获得 了广泛的关注。从算法研究的角度来说,大量下游视觉任务依赖于目标检测输 出,比如物体分类、实例分割、目标跟踪、人脸识别、行人重识别、人物交互等。 由此可见,开展目标检测相关研究具有较高的学术价值与应用价值。

随着深度学习的兴起以及大规模目标检测数据集的提出,目标检测技术随 着网络结构的发展,检测算法的革新,优化策略的提升取得了长足的发展。然而 在目前目标检测研究中,在数据的使用方式和模型的训练策略中均存在大量对 可用信息利用不充分的情况,严重制约了的目标检测性能的提升。如何能够挖掘 各类数据之间以及不同模型之间的信息表示与传递机制,是目标检测研究的一 个重点方向。本文以目标检测中的信息传递为研究重点,针对单模态模型内,单 模态时序间,跨模态数据间以及模型与模型间等多个方面的信息传递展开了系 统研究。论文的主要贡献包括以下几个方面:

• 提出了一种基于单模态模型内不同尺度物体间信息相互传递的目标检测网 络结构。针对目标检测中相同类别不同尺度物体间信息无法相互流通的问 题,本文提出了一种尺度敏感的多分支权重共享的目标检测网络。不同于 基于尺度金字塔方法对不同尺度物体分治的方法,本文提出的多尺度参数 共享网络,能够利用同一组的参数进行多尺度物体特征的学习,更加有效 的利用了相同类别不同尺度物体标签间的信息。在权威目标检测数据集上 的实验表明,本文提出的目标检测网络相对以往的网络结构能有明显性能 提升。

• 提出了一种基于单模态时空序列中相同物体的不同视图间信息传递的视频 目标检测方法。针对视频目标检测中同一物体的信息在不同帧间无法有效 流通的问题,本文提出了一种在时空序列中基于实例特征的特征聚合方法。 通过利用相同物体在不同时刻的冗余信息,使得视频检测器能在目标快速 运动或被其他物体严重遮挡的情况下,依旧能稳定的对目标物体进行检测。在大规模视频目标检测数据集上评测的结果表明,本文中所提出的视频目 标检测方法相对于以往方法能够有稳定提升。

• 提出了一种基于多模态时空序列间异质信息传递的自监督视觉目标检测方 法。针对目标检测中同一物体在不同模态数据间信息不流通的问题,本文 设计的方法,从点云模态中以及图像模态中分别提取目标的深度信息以及 目标的运动信息并加以融合,打通了不同模态数据间的信息壁垒,实现了 自监督范式下的目标检测。在大规模自动驾驶数据集上的实验表明,本文 所提出的自监督视觉目标检测方法能够与利用人工标签训练的目标检测方 法取得了相当的性能。

• 提出了一种基于不同规模模型间信息传递的深度网络加速方法。针对深度 学习中小规模网络模型提取数据中信息能力不足的问题,本文提出了一种 通过样本间相似度来表示模型信息的方法。并通过在大规模网络模型与小 规模网络模型之间传递样本间相似度信息,实现了信息从大规模网络模型 向小规模网络模型的传递。在以目标检测为代表的多种计算机视觉任务上 的实验表明,本文设计的模型信息表示形式与模型间信息传递机制能够有 效的增强小规模网络模型的对于数据中信息的利用,缓解小规模网络模型 信息提取能力不足的问题。

 

Other Abstract

Object detection is one of the most fundamental tasks in computer vision. The purpose of the object detection task is to identify and locate object instances in the image within a predetermined classes. From the angle of algorithm application, object detection has draw attention in areas like public safety, video surveillance, autonomous driving, abnormal behavior identification, video understanding and searching. From the angle of algorithm research, a large number of downstream vision tasks rely on object detection, such as instance classification, instance segmentation, object tracking, face recognition, pedestrian re-identification, and human-object interaction. Therefore, object detection is research topic of both academic value and application value.

With the rise of deep learning and the dawn of large scale object detection datasets, object detection research has made great progresses with the development of network architecture, detection paradigm and optimization technology. But the inefficient utilization of information in the processing of data and the training the network model has hampered the research progress of object detection. How to device the information representation and information transfer mechanism between different types of data and models has become a heated topic in object detection. This dissertation focuses on information transfer-based visual object detection research which features single modal inmodel information transfer, single modal time-space sequence-level information transfer, cross-modal information transfer and cross-model information transfer.

The contributions of this dissertation are summarized as follow:

• This dissertation proposes an object detection network based on information transfer between objects of different scales in a single modal model. To mitigate the information blockage between objects of the same class but different scales, this dissertation proposes a scale-ware multi-branch weight-sharing object detection network. Different from scale pyramid-based methods which take a divide and conquer approach for solving scales difference in objects, the multi-scale weightsharing network proposed in this dissertation could use a single set of parameters to learn features for objects of different scales. Evaluations on the famous object detection benchmark show the object detection network proposed by this dissertation could bring significant improvement over previous object detection networks.

• This dissertation proposes a video object detection method based on information transfer between the different views of the same object in the time-space sequence. To address the information blockage of the same objects between different video frames in video object detection, this dissertation proposes an instance-level feature aggregation method in the time-space sequence. By using the redundant information of the same object in different video frames, the video object detector proposed in this dissertation could function stably in the scenario where the object of interest in under fast motion or occlusion. Experiments on the large scale video object detection dataset proves the method proposed in this dissertation could outperform existing works.

• This dissertation proposes a self-supervised visual object detection method based on heterogenous information transfer on the multi-modal space-time sequence. To mitigate the information blockage between different data modalities of the same object, this dissertation proposes a method which extracts and fuses object depth information from the point cloud modality and object motion information from the image modality. This method bridges the information gap between data of different modalities and realizes a self-supervised paradigm for object detection. Experiments on the large scale autonomous driving dataset has shown the proposed self-supervised visual object detection method could achieves comparable results of its fully-supervised counterpart.

• This dissertation proposes an acceleration method for deep network based on information transfer between models of different sizes. To mitigate the inefficiency in information extraction for small-sized deep network, this dissertation proposes a model information representation based on cross-sample similarity. This method transfers information from large-sized network to small-sized network via cross-sample similarity. Experiments on multiple computer vision tasks has proved the information representation and transfer mechanism could improve the data efficiency for small-sized networks and alleviate the inefficiency in information extraction for small-sized deep network.

Keyword目标检测 信息传递 网络结构设计 无监督目标检测 视频目标检测
Language中文
Sub direction classification图像视频处理与分析
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/44884
Collection智能感知与计算
Recommended Citation
GB/T 7714
陈韫韬. 基于信息传递的视觉目标检测[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.
Files in This Item:
File Name/Size DocType Version Access License
陈韫韬毕业论文答辩后改版v0.2.pdf(26811KB)学位论文 开放获取CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[陈韫韬]'s Articles
Baidu academic
Similar articles in Baidu academic
[陈韫韬]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[陈韫韬]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.