英文摘要 | Object detection is an important research direction in the field of computer vision, which is widely used in many tasks, including video surveillance, face recognition, automatic driving, and so on. Recently, image object detection based on deep learning has made great progress. However, due to motion blur, partial occlusion and rare poses in the video, it is difficult to solve these problems by existing image detectors only through frame-by-frame processing. Actually, the video contains rich temporal cues. Currently, object detection in videos with the effective use of the inherent temporal information under the framework of deep learning is a hot topic in the field of computer vision. This thesis studies the construction and optimization of spatio-temporal deep learning model, which aims to use spatio-temporal cues in the video to improve the accuracy and speed of object detection.
The main contributions of this thesis are as follows:
1. This thesis presents a video object detection method based on Locally-Weighted Deformable Neighbours, LWDN for spatio-temporal feature alignment. First, inspired by human-brain mechanism, we propose to iteratively propagate and update the memory feature between the keyframes to improve the discriminative ability of the keyframe features. Then, LWDN is used to align the keyframe features to the non-keyframes in both image space and time axis to enhance the feature learning ability of the detector from the non-keyframes. In this process, low-level features of both keyframes and nonkeyframes are utilized to learn the similarity weights and object location offsets. Finally, feature alignment is implemented by linearly weighted object location offsets. Accordingly,a deep learning model is developed for video object detection, which integrates the above techniques together for end-to-end training. Compared with flow-warping based method, our proposed method achieves better performance with fewer parameters.
2. This thesis presents a video object detection method based on Learnable Spatio-Temporal Sampling, LSTS for spatio-temporal feature alignment. First, the feature similarities between the pixels in the non-keyframe and their corresponding pixels that are obtained in the keyframes with the location sampling strategy are estimated. Accordingly,these similarities are employed to linearly weighten the features of sampled pixels in the keyframes for further pixel location alignment. Then, under the help of the supervision of minimizing the object detection loss, the sampled locations are optimized and updated iteratively. The spatio-temporal alignment of features on space and time axis is realized by the propagation of high-level features between different video frames. In this process, we further propose an approach of Sparsely Recursive Feature Updating, SRFU and Dense Feature Aggregation, DFA to improve feature representation of the whole video frames, which helps achieve more accurate spatio-temporal correspondences. Based on the proposed spatio-temporal feature propagation method, a deep learning model of video object detection is established. Experimental results on the ImageNet VID benchmark demonstrate the effectiveness of our proposed method.
|
修改评论