With the rapid growth of the multimedia and network technology, the amount of videos on the net and in our daily lives are increasing greatly, and the videos about human actions make up a large percentage. So it is urgent to recognize and locate human actions efficiently in these videos. Human action recognition(HAR) is an active topic in computer vision. HAR is to recognize and analyze human action, interactive action, and group action from images or videos by computer vision technology. The essence of this problem is how to efficiently analysis and represent human actions, to construct relations between low-level features and semantics. Besides the great value in research, HAR also has many potential applications in smart surveillance, automatic analysis of sport events and tracking. Recently, as the development in HAR and the widely use of bag of visual words(BOVW) etc., the vital role of local features is obvious. In most papers, the authors usually just use the local features themselves or some additional simple geometry information such as their coordinates. Little attention has been paid to the relations among the features. In this thesis we focus on the relations among the local features and propose some strategies to describe their relations and combine them with local feature vectors to mine local information more comprehensively. The main contributions of our work are summarized as follows, which are both based on the framework of BOVW: °1 We propose a new measurement to evaluate the relations among local features and then compute the visual similarity between each pair of actions. In our framework, each local feature is indexed by the action label of its home video, we extract visual similarity of each feature and give it a weight. Combining all the similarities and the global feature distribution of each action, we computer the visual similarities between each pair of actions. Inspired by Metric Learning, the similarity is embedded into the Euclidean space so as to enlarge the distance between two features if they come from different but similar actions. Thus we obtain a more discriminative visual vocabulary. The experiments in Weizmann and KTH datasets show that our approach outperforms the traditional vocabulary based approach by about 5%. °2 We propose a novel representation of local feature neighborhood. We find several linearly independent nearest features about each local feature, utilize their spatial or temporal information to co...
修改评论