基于时空建模的行为识别研究

CASIA OpenIR > 毕业生 > 硕士学位论文

	基于时空建模的行为识别研究
	罗锦钊
	2023-06
页数	61
学位类型	硕士
中文摘要	随着计算机算力提升和视频数据量的迅速增长，智能视频分析成为了计算机视觉的一个重要研究方向。人体行为识别作为其中一个子方向，在人机交互、虚拟现实、视频分析、智能监控和健康养老等领域具有广泛的应用前景。最近，行为识别领域中的深度学习方法发展迅速，但无论是基于 RGB 视频的行为识别还是基于骨架的行为识别都存在一定不足。RGB 视频行为识别使用深度神经网络，如循环神经网络、卷积神经网络等，或 Transformer 架构提取视频中的时空信息。然而，这类方法容易受到复杂背景、光照条件等环境因素影响，还存在输入数据信息冗余和时空上下文关系建模不足等问题。随着传感器的不断发展，人体骨架数据作为一种紧凑高效的人体结构表达形式逐渐成为行为识别的重要基础数据之一。基于骨架的行为识别能够避免复杂背景的影响，在时空特征表达方面更加高效，但其对局部动作特征建模和远距离关节点关系表达不足。因此，本文旨在探索行为的局部特征建模和时空关系建模。主要研究工作和贡献如下： (1) 针对当前骨架行为识别方法对局部动作特征的建模能力不足以及对行为空间关系表示不充分的问题，提出了基于时间通道拓扑增强卷积网络的骨架行为识别方法。该网络利用通道注意力增加关键节点和关节在分类中的权重，引入通道距离矩阵动态建模不同动作下的远距离节点关系，以构建鲁棒的局部动作特征表示，提高行为识别准确率。所提方法在骨架行为识别数据集 NTU RGB+D、 NTU RGB+D 120 和 FineGym 上进行了评估，与现有方法相比显示出优越的性能。 (2) 针对 RGB 视频输入信息冗余，以及复杂行为场景上下文信息建模不足的问题，提出了基于时间差分融合卷积网络的行为识别与检测方法。时间差分卷积卷积网络使用双流架构分别提取表观静态信息和行为运动信息；利用时间差分模块对行为交互区域的行为表观信息进行建模，减少输入信息冗余；引入通道融合注意力模块提取时空行为特征，并建模行为-场景上下文关系；基于 YOLO 检测框架实现行为识别与检测。在 UCF101-24、J-HMDB-21 和 AVA 数据集验证模型的有效性，准确率分别达到了 82.1%，78.1%，18.8%。这两项研究的重点是基于卷积神经网络对时空行为特征进行建模。其中，第一项研究强调局部动作特征的建模，是实现复杂行为识别的基础，所使用的骨架数据模态可以作为第二项研究中使用的 RGB 模态的扩展和补充。第二项研究专注于复杂行为的时空上下文关系建模，其模块化设计结构能够集成不同的模态，增强模型的时空特征建模能力，提高视频行为识别的准确性。
英文摘要	With the improvement of computer computing power and the rapid increase in the amount of video data, intelligent video analysis has become an important research direction of computer vision. Human action recognition, as one of the sub-directions of intelligent video analysis, has broad application prospects in the fields of human-computer interaction, virtual reality, video analysis, intelligent monitoring, and health and elderly care. Recently, deep learning methods are very popular in action recognition, but both RGB-based methods and skeleton-based methods have certain deficiencies. RGB-based action recognition methods use deep neural networks such as recurrent neural networks, convolutional neural networks, or Transformer architectures. However, these methods are easily affected by environmental factors such as complex backgrounds and lighting conditions. They also suffer from serious information redundancy in their input data and insufficient modeling of temporal and spatial contextual relationships. With the continuous development of sensors, human skeleton data has gradually become one of the important basic data for action recognition due to its compact and efficient representation of the human structure. Although skeleton-based human action recognition can avoid the influence of complex backgrounds and be more efficient in expressing temporal and spatial features, it still lacks sufficient expression of local action features and long-distance joint relationship modeling. Therefore, this study aims to explore local feature modeling and spatio-temporal relationship modeling of behaviors.The main research work and contributions are as follows: (1) The paper proposes a skeleton-based action recognition method that overcomes the limitations of existing methods in modeling local action features and representing spatial information. The proposed Temporal-Channel Topology Enhanced Network employs channel attention to prioritize critical nodes and joints during classification, and introduces a channel distance matrix to model distant node relationships under different actions dynamically. The network constructs a robust local action feature representation to improve action recognition accuracy. The proposed method is evaluated on skeleton action recognition datasets, including NTU RGB+D, NTU RGB+D 120, and FineGym, and shows superior performance compared to existing methods. (2) The paper proposes a action recognition and detection method based on a Temporal Difference Fusion Network. This method addresses the issues of information redundancy in RGB video input and the difficulty in modeling complex behavior-scene context information. The proposed model uses a dual-stream architecture to extract static appearance information and motion information separately. A Temporal Difference Module is used to model the action appearance information of the action interaction region to reducing input information redundancy. Introducing a Channel Fusion Attention Module to extract spatio-temporal action features and model action-scene context relationships. Finally, the YOLO detection framework is employed to implement action recognition and detection. The proposed model is evaluated on UCF101-24, J-HMDB-21, and AVA datasets and achieves accuracies of 82.1%, 78.1%, and 18.8%, respectively. In general, the two studies focus on modeling spatio-temporal features in action recognition using convolutional neural network architectures. The first study emphasizes the modeling of local action features, which is essential for recognizing complex behaviors. The skeletal data modality used in the first study serves as an extension and complement to the RGB modality used in the second study. The second study focuses on modeling complex behavior-scene context relationships, and its modular design structure allows for easy integration of different modalities, further improving the model’s ability to extract spatio-temporal features and enhancing the accuracy of video-based action recognition.
关键词	行为识别人体骨架注意力机制时空建模
语种	中文
七大方向——子方向分类	图像视频处理与分析
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52288
专题	毕业生_硕士学位论文
推荐引用方式 GB/T 7714	罗锦钊. 基于时空建模的行为识别研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis.pdf（11026KB）	学位论文		限制开放	CC BY-NC-SA