基于视觉词包模型的行为识别

CASIA OpenIR > 毕业生 > 博士学位论文

	基于视觉词包模型的行为识别
其他题名	The Bag-of-Visual-Words Model based Human Action Recognition
	原春锋
	2010-12-03
学位类型	工学博士
中文摘要	人体行为识别（HAR）是计算机视觉领域中备受关注的前沿方向和最为活跃的研究主题之一，它是指利用计算机视觉技术从图像或视频序列中识别和理解人的个体行为、人与人之间以及人与外界环境之间的交互行为。在运动目标的视觉分析中，人体行为识别（HAR）占有极其重要的地位，属于视觉中的高级处理部分，是运动分析的最终目的。人体行为识别（HAR）除了具有重要的理论研究价值以外，还在智能监控、运动分析、人机交互、虚拟现实等方面有着巨大的应用前景和潜在的经济价值，研究人体的行为模式将为人们的生活带来全新的交互方式。近年来，行为识别处于飞速发展阶段，其中基于局部特征的视觉单词的词包（BOVW）方法日益成为主流方法之一。基于BOVW的识别方法避免了传统行为识别方法依赖于前景分割、目标检测、目标跟踪等技术的缺陷；并且与基于全局特征的方法相比，该类方法对于噪声、遮挡、行为类内变化等更具鲁棒性。本文以视频中人体行为识别作为研究课题，对基于视觉词包识别系统中的重点和难点包括特征提取、运动表征、行为识别等方面展开了深入研究。论文的主要工作和贡献如下： 1 提出了一种新的局部时空特征-基于协方差对数欧氏黎曼度量的时空区域描述子。其统计了底层特征的协方差矩阵来表示在视频序列中检测出来的感兴趣区域，然后由于协方差矩阵不在欧氏空间内，我们提出引入对数欧氏黎曼度量来计算两个协方差矩阵之间的距离。这种描述子可以同时融合多种底层特征例如光流、梯度等，而以往的描述子大多只统计了一种底层特征。另一方面，我们采用推土机距离(EMD)对视频序列对进行匹配，与广泛应用的欧氏距离相比，EMD对于度量大小不同的直方图间的匹配具有更好的性能。 2 提出了一种金字塔词汇树来构建词汇表。对于视觉单词的词包(BOVW)模型，词汇表的大小对识别结果有很大的影响。一般地，大的词汇表对类间的行为更有区分力，而较小的词汇表对类内行为的变化具有更好的容忍性且对噪音更鲁棒。我们提出一个金字塔状的词汇树来对局部时空特征进行建模，与传统单一词汇表相比，词汇树同时兼具大词汇表和小词汇表的优点，既能区分类间的差异又能容忍类内的变化。另外，我们结合局部特征的时空信息，提出了一种稀疏的时空金字塔匹配核（SST-PMK）来度量视频序列间的相似度。实验证明提出的SST-PMK均优于SVM分类器中其它常用的核函数。 3 提出了一种时空邻近分布矩阵来捕获局部时空特征的空间几何分布情况，并且其同时也刻画了行为类的表观特征。该时空邻近分布矩阵克服了BOVW方法一个最主要的缺点，即几何无约束性所导致其无法区分那些具有相同的时空特征但特征的空间分布不同的不同类行为。进一步，一个与之相应的时空邻近分布核被设计以度量视频对间的相似度。我们提出的基于时空邻近分布的行为识别算法在KTH数据库取得了目前最好的识别准确率。 4 提出了一种新的融合策略-通过基于上下文信息的融合机制将行为的两种互补表征融合在一起进行行为识别。一方面，我们采用了基于时空兴趣点的表观信息的行为建模，和基于时空兴趣点的位置信息的行为建模两种互补的行为表征...
英文摘要	Human action recognition(HAR) is an active topic in computer vision. HAR is to recognize and analyze human action, interactive action, and group action from the image or video by computer vision technology. HAR plays an important role in visual analysis and understanding of human motion, and it is the high-level vision part and the ¯nal Objective. Besides, It has many potential pplications, such as smart surveillance, automatic analysis of sports events, human-computer interface, and virtual reality. The research in HAR will bring new interactive methods for people's lives. Recently, the development in HAR is fast. Furthermore, a trend in action recognition is the bag of visual words (BOVW) appearance-based approach, which exploits local spatio-temporal features. The BOVW based approaches avoid several difficult preprocessing such as foreground segmentation, object detection, object tracking. Moreover, the BOVW based approaches are more robust to noise, occlusion and action variation including geometric variations than the large-scale features based ones. In this thesis, we study the human action recognition which involves a lot of important and difficult problems e.g. feature extraction, action representation, action recognition, etc.. The main contributions of our work are summarized as follows: 1 We propose a new local spatio-temporal feature to describe the cuboids detected in video sequences. Specifically, the descriptor utilizes the covariance matrix to characterize the low-level features within each cuboid. Covariance matrices do not lie on Euclidean space. Therefore, the Log-Euclidean Riemannian metric is employed to measure the distances between covariance matrices. Moreover, the Earth Mover's Distance (EMD) is employed to match pairs of video sequences for the first time. Compared with the widely used Euclidean distance, EMD is more robust in matching histograms/distributions with different sizes. Experimental results on three datasets demonstrate the effectiveness of the proposed framework. 2 We propose a pyramid vocabulary tree to model local spatio-temporal features. For the BOVW based methods, it is crucial to determine the size of vocabulary. Usually, large vocabulary size of the BOVW is more discriminative for inter-class action classification while small one is more robust to noise and thus tolerant to the intra-class variance. The proposed pyramid vocabulary tree can both characterize the inter-class difference and allow intra-c...
关键词	时空区域描述子黎曼度量推土机距离金字塔词汇树稀疏时空金字塔匹配核时空邻近分布特征融合视觉词包行为识别 Spatio-temporal Covariance Descriptor Riemannian Metric Earth Mover's Distance Pyramid Vocabulary Tree Sparse Spatio-temporal Pyramid Matching Kernel Spatio-temporal Proximity Distribution Bag-of-visual-words Features Fusion Action Recognition
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/6311
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	原春锋. 基于视觉词包模型的行为识别[D]. 中国科学院自动化研究所. 中国科学院研究生院,2010.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
CASIA_20071801462807（5826KB）			暂不开放	CC BY-NC-SA