The object detection based on deep learning has always been one of the most fundamental and important research topics in the field of computer vision. Given the input images and the concerned categories, it aims to outputting the classification and location results with the corresponding confidence scores and bounding boxes. The object detection task can be regarded as the basis of many high-level vision tasks in the field of
academic research, such as object tracking, instance segmentation, pedestrian detection and image caption, etc. The development of object detection has promoted many high-level vision tasks to a certain extent. Besides, object detection can provide technical support for a large number of downstream tasks in the field of industrial application,
such as intelligent security, face recognition, unmanned driving, remote sensing monitoring, etc. Therefore, the research and exploration of object detection based on deep learning has very important theoretical significance and application value. Whether it is in the field of academic research or industrial application, the input image samples of object detection task usually come from real scenes. However, there
are many objects with various scales in the real scenes. The scale variation is one of the extreme challenges in object detection. The traditional methods are not robust to multi-scale object detection, which is difficult to take into account the performance of objects with various scales. In order to alleviate the scale variation in object detection, the multi-scale feature fusion method represented by feature pyramid is widely used in
object detection models as a basic structure. Therefore, this dissertation studies the scale variation problem in object detection from three aspects: the multi-scale feature extraction and utilization, the multi-scale feature fusion and the multi-scale feature learning.
The main contributions and contents of this dissertation are summarized as follows:
• We propose a multi-scale object detection framework based on the adaptive coarse-to-fine interactor. Based on the multi-scale feature fusion method represented by the feature pyramid, there are insufficient extraction and low utilization of multiscale features, which can lead to the poor multi-scale representation. To alleviate the above problems, the adaptive coarse-to-fine interactor is proposed. Compared with the
previous feature pyramids, the proposed multi-scale interactor can make full use of the corresponding coarse-grained features of the multi-scale features to complement the scale information of the features from different resolutions. And the multi-scale interactor also can extract precise fine-grained spatial information from multi-scale features.
The effective extraction and utilization of multi-scale features can make the constructed pyramid structure full of stronger multi-scale representation. Based on the validation and analysis of the experiments, the proposed multi-scale interactor has shown superiority
in object detection and instance segmentation tasks, respectively. The proposed adaptive coarse-to-fine interactor can effectively improve the multi-scale feature extraction and explore the multi-scale inherent information, which alleviates the scale variation to some extent.
• We propose a multi-scale object detection framework based on the dynamic sample-individualized connector. The specifically designed architectures and the fixed multi-scale interaction process are not flexible for the feature fusion and may lead to the learning deviation of different layers, especially when fed with various samples. Based on the above problems, the dynamic sample-individualized connector is proposed.
Compared with the rigid feature fusion mechanism in traditional methods, the proposed connector can dynamically adjust the multi-scale fusion process according to the input samples. Firstly, the proposed connector can select the proper multi-scale features for the fusion process according to the input samples, dynamically. Then it can activate informative data flow paths based on the extracted multi-level features for a flexible
multi-scale fusion, automatically. The proposed dynamic sample-individualized connector can activate different data flow paths of the extraction and the interaction of multi-scale features, which can achieve a divide-and-conquer style. Based on the validation and analysis of the experiments, the dynamic sample-individualized connector
can significantly improve the object detection performance of various scales.
• We propose a multi-scale dense image prediction framework based on the semantic-aware decoupled transformer pyramid. The semantic-aware decoupled Transformer pyramid is proposed in order to make the multi-scale features to learn global information, effectively. Transformer can effectively help features learn longrange global information. Thus, based on the designed semantic perception mechanism, the high-level feature can explore the diversity of global semantic information. Besides,
the proposed cross-level decoupled interaction strategy can effectively and efficiently promote the interactive learning of features among different levels in the decoupled space, so that the features among different levels can learn the global information from their own and the other levels. After that, the ability of multi-scale representation and
global perception can be enhanced. Based on the validation and analysis of the experiments, the proposed semantic-aware decoupled Transformer pyramid can effectively and efficiently alleviate the scale variation in various dense image prediction tasks. And the proposed method has achieved better performance in various dense prediction tasks
such as object detection and semantic segmentation.