With the development of Internet, both the amount and the diversity of video data are increasing dramatically. This brings in new challenges for processing and understanding such video data. Human tracking and action analysis have been active research topics for over three decades in computer vision. Tracking is a middle-level vision problem, which can provide motion trajectories, pose parameters for afterwards high-level task. Human action recognition is such a high-level task, which is built upon low-level features and middle-level tracking modules. The major contributions of this thesis are summarized as follows, 1. We propose a new tracking algorithm which combines object and background information. Object and background appearance models are built simultaneously by non-parametric kernel density estimation. Our major contribution is a novel bidirectional learning framework discriminating the object and the background. It provides a mechanism to detect occlusion and distraction, and performs feature selection making the tracker more robust to outliers. By this learning framework, we are able to embed discriminative information into the generative appearance models. 2. Tracking is formulated as a learning problem of discriminating the object from its nearby background. We propose a novel semi-supervised algorithm for tracking by combining Semi-supervised Discriminant Analysis (SDA) with an online boosting framework. Using the local geometric structure information from the samples, the SDA-based weak classifier is made more robust to outliers. Meanwhile, we design an incremental updating mechanism for SDA so that it can adapt to appearance changes. We further propose an Extended SDA (ESDA) algorithm, which gives better discrimination ability. 3. Local space-time features have recently become a popular video representation for action recognition. We evaluate and compare four different feature detectors and six local feature descriptors using a standard bag-of-features SVM approach. Among interesting observations, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations. 4. We propose to describe videos by dense trajectories. Specifically, we sample dense points from each frame and track them based ...
修改评论