The rapid development of multimedia and Internet and the high popularity of mobile devices have led to explosive growth of digital images and videos. However,how to efficiently process redundant images and find the useful information from the data become increasingly important. Object detection based on images and videos provides technical support for effective management of large data. Therefore, it is of great practical value to design a simple, efficient and robust object detection algorithm to help users extract useful information from massive image data.
Object detection can also be seen as a classification task for candidate object regions in essence, so efficient and accurate feature expression is a key factor for the final performance. In traditional methods, we usually need to design features manually, such as Haar-like features, Local Binary Patterns (LBP), and Histogram of Oriented Gradient (HOG), etc. In order to further enhance the representation ability of features, many handcrafted features, like Integral Channel Features (ICF) and Aggregated Channel Features (ACF), are often integrated together to make the final prediction. However, the generalization ability of the handcrafted features is often insufficient. Generally, these features are designed only for specific targets (such as faces, pedestrians and so on). At the same time, the representation ability of handcrafted features is also relatively weak, lacking the connections with high-level semantics, and it is difficult to bridge the ”semantic gap”. In recent years, with the successful practice of Convolutional Neural Network (CNN) in the field of image classification, its powerful discriminative ability and feature representation ability have been fully demonstrated. Despite the great success of CNN-based object detection, it still faces a series of problems, such as scale, occlusion, deformation, pose and complex background interference. At the same time, deep network often suffers from a large number of parameters and high float point operations(FLOPs) which makes it difficult to deploy in practice. Although small shallow network has small capacity and high speed, it always leads to low accuracy. Therefore, based on deep learning technology, this dissertation proposes to design reasonable network structures and strategies for specific problems to guide the feature learning of the network, so that the network can learn a more accurate and robust semantic feature expression, thus greatly improving the performance of object detection.
The main contributions of this dissertation are summarized as follows:
1. This dissertation proposes a scale-adaptive deconvolutional regression network to solve the problem of the diversity of pedestrian scales in surveillance scenarios and the inadequacy of CNN in describing the object of different scales. By introducing the deconvolutional layer to up-sample the deep feature maps, the network not only recovers its spatial resolution, but also takes into account the high-level semantic information of features, thus forming a feature pyramid structure within the network. According to the scale of candidate pedestrian, the pedestrian features are adaptively extracted from the feature maps of the corresponding resolution, which can effectively improve the
detection performance of pedestrian of different scales. Furthermore, considering that CNN can abstract features of different semantic levels from different layers, the network integrates multi-depth convolutional features to provide global and local details of the pedestrian, thus further improving the discriminative power of pedestrian features and enhancing the final classification performance. Extensive experiments demonstrate the effectiveness of the proposed approach, and the best detection results are achieved on several public pedestrian datasets at the same time.
2. This dissertation proposes a deep coupling network, named as CoupleNet, to solve the various object configurations in general scenarios, like the differences of intraclass, the interference of inter-class and the occlusion, truncation and deformation of the object. By designing a simple and efficient coupling structure, the network can simultaneously learn information at different levels of the object, such as global structure, local parts and context information, and effectively couple these three information to form a comprehensive feature expression, which greatly improves the performance of generic object detection. At the same time, by designing different coupling strategies and normalization methods, the network can achieve end-to-end training and test. Moreover,
the complementary advantages of different features are fully exploited, which improves the ability of the detector to model objects in various complex scenarios. The experimental results show that the proposed method is far superior to the classical CNN-based detection methods in several popular datasets, and achieves the best performance of single model in the same period.
3. This dissertation proposes a fully convolutional attention coupling network to solve the complex background and crowded scenes in natural scenes. The network effectively integrates the attention mechanism and the structure information of the object, it can automatically focus on the foregrounds and enhance the feature representation of the foreground objects while also suppressing the interference of the backgrounds or similar regions. Specifically, by designing a cascade attention module, a series of category-independent attention maps are generated to gradually guide the network to focus on the target regions to be detected. The attention maps are then applied to the convolutional feature maps to enhance the features of regions and guide the learning of basic network features, thus improving the network’s perception of objects in complex scenes. Moreover, the cascaded attention module can be easily embedded into the existing convolutional neural network and optimized jointly with the detection loss. The experimental results show that the proposed method can effectively improve the performance of existing object detectors.
4. Currently, the detectors with large backbones enjoy good performance, but they always suffer from large parameters and high complexity. On the contrary, using a small network leads to poor performance. In view of this problem, this dissertation proposes a framework of object detection based on knowledge distillation. The framework can make full use of the strong discriminative features learned by the large network to guide the learning of the small network, so it can greatly improve the detection performance of the small network and facilitate the deployment in resource-constrained scenarios. Specifically, we design a mask-guided knowledge distillation to effectively integrate the global feature distillation and local feature distillation, which helps the model pay more attention to the local regions containing the objects in the training stage while also preserving the whole image information. This makes the transfer of knowledge more targeted, thus speeding up the convergence of the small network. Extensive experiments demonstrate that the proposed approach can still maintain a good detection accuracy under the condition of achieving a significant model compression rate.