With the rapid development of Internet and mobile terminals, the amounts of video data is increasing dramatically. This results in new challenges for processing video data. Human action recognition is the essential issue and key step for video processing, which directly affects the results of other fields, such as video content understanding, scene understanding and so on. Meanwhile, human action recognition is the critical technology for applications of computer vision and pattern recognition, for example video surveillance, video indexing and human-computer interaction. Recently, researchers have conducted huge amounts of works on human action recognition. However, the semantic information is not fully explored. To overcome the drawback, we study the semantic methods to bridge the gap between computers and human. Main contents of this thesis can be summarized as follows: 1. In order to overcome the drawbacks of the traditional bag-of-words, a novel coding strategy called context-constrained linear coding is proposed. This method presents the concept of contextual distance that explicitly considers the spatio-temporal relationship among feature points. In addition, the proposed method utilizes several nearest codewords to linearly reconstruct the local feature, which could alleviate the quantization error. 2. To solve the problem that the traditional maximum margin clustering (MMC) treats each frame independently and neglects the temporal relationship between contiguous frames in the same action video, contextual maximum margin clustering (CMMC) is proposed. The temporal regularization is added to the objective of traditional MMC. The CMMC not only achieves the goal of finding maximum margin hyperplanes, but also explicitly considers the temporal information between contiguous frames. 3. We propose a novel method for cross-view action recognition via a continuous virtual path. All the virtual views on the continuous virtual path are concatenated into an infinite-dimensional feature which acts as the final feature vector for classification. A virtual view kernel is proposed to compute the value of similarity between two infinite-dimensional features. We solve the virtual view kernel under an information theoretic framework that allows maximizing discrimination. Furthermore, we present a constraint strategy to explore the visual information contained in the unlabeled samples. 4. The traditional multi-task learning methods neglect the constraints t...
修改评论