基于时空信息分析的人的姿态与行为感知

CASIA OpenIR > 模式识别实验室

	基于时空信息分析的人的姿态与行为感知
	司晨阳
	2021-05-19
页数	142
学位类型	博士
中文摘要	基于计算机视觉的人体感知技术是一个复杂和重要的研究课题，且具有非常大的科学价值。美国著名心理学家艾伯特. 赫拉伯恩曾经提出：信息交流的结构=7% 语言+38% 语调语速+55% 表情和动作，因此在人体感知任务中对姿态行为的表示与理解尤其重要。虽然相关研究已经取得了一些进展，但是现有方法和技术仍然存在着不足之处，相关理论和技术研究具有较大改善的空间。时空信息是人体姿态行为的基本构成元素，在非受限环境下，人体图像视频在空间和时间域的复杂状态为基于视觉的人体感知带来了巨大的挑战。本论文围绕如何设计有效的学习方法来对人体姿态行为的时空信息进行学习与理解的问题展开，系统深入地重点研究了人体时空结构信息：首先针对空间多视角感知层面，研究了多视角人体动作图像合成中姿态理解与表达的问题；接着分别以时空层次感知和时空协同感知为依据，深入探索了人的行为序列数据中自动提取行为时空特征的难题；最后，针对人的行为感知中严重依赖大量标注数据的现象，探讨了半监督时空感知的研究问题。本文取得的研究成果主要包含以下四项：多视角人体动作图像合成任务涉及了人体空间多视角感知的难题。为此，本文提出基于人体姿态的多视角人体动作图像合成方法，通过对低维度人体姿态的分析和理解，以解决人体动作的多变性、不同视角下人体二维姿态的巨大差异性给多视角人体图像合成带来的挑战。此外，为了保证合成新视角图像的外观特性与原始视角图像的一致性，该方法提出设计模块化网络和多阶段对抗学习的策略确保人体图像具有正确的外观。具体而言，该方法包含了三个阶段，由低维人体姿态逐步合成最终的高维人体图像。在第一阶段，姿态转换网络根据输入的原视角人体骨架和目标视角信息来合成目标视角的人体骨架数据。在第二阶段，前景转换网络根据预测的目标视角人体骨架信息与原视角下的人体外观信息来合成目标视角下的人体前景图像。最后在第三阶段，一个背景转换网络用于生成具有清晰背景的目标图像。为了解决均方误差损失函数引起的图像模糊的问题，在多阶段使用对抗训练提升图像合成质量。基于骨架的行为识别是从给定的骨架视频数据中提取运动特征用于预测人的行为类别，为了解决如何高效地从骨架序列中提取复杂的时空特征，提出了层级空间推理和时序叠加学习网络，以时空层次感知的方式分别建模人体的空间结构特征和时序动态特征。人的行为发生是由人体各个部位协调完成的，例如行走需要腿和手臂的相互协调完成，该模型中的层级空间推理网络利用一个层级残差图神经网络挖掘人体的空间结构依赖关系，进而有效地表示人体的空间特征。此外，人体的时序运动信息在行为识别中是另一个非常重要的判别依据，模型中的时序叠加学习网络可以获取长时间骨架序列的详细运动特征。在训练过程中，进一步提出一个基于视频片段的增量损失函数可以进一步提升时序叠加学习网络的学习能力，为解决长序列优化问题提供了一种有效的方法。上述工作中已经证明了描述人体的空间结构依赖关系和时序动态特征对行为识别是非常重要的信息，虽然该方法在行为识别上表现出非常优秀的性能，但是忽略了时间域和空间域的协同感知。考虑到人体骨架视频天然具备多种依赖关系，即每一帧内的空间依赖关系、不同帧之间的时序依赖关系、还有空间域和时间域的同现关系，提出一个注意增强的图卷积递归神经网络用于基于骨架的行为识别任务。图卷积递归神经网络不仅可以有效地捕获骨架序列的空间依赖关系和时序动态特征，而且可以探索空间域和时间域同步发生的依赖关系。此外，该模型采用视觉注意机制来自适应地选择与运动类别相关的关节点信息，而且加强这些信息在网络中的传递，促使学习到更具有判别性的细节运动特征。针对人的行为识别领域需要依赖大量的标注数据来训练模型的现状，探讨了半监督的行为识别任务。自监督学习已经被证明可以利用辅助任务(Pretext Task)从大规模无标签数据中学习到丰富的语义特征。因此，本论文首次提出将自监督学习和半监督人的行为识别任务结合，即基于对抗的自监督学习方法，该方法通过对抗学习和探索近邻一致性将自监督学习耦合到半监督行为识别任务中。具体而言，首先设计了一个有效的自监督方法学习无标签数据的语义信息，即通过探索邻域内的样本关系来提高对基于骨架的行为识别任务的表征学习能力。又进一步提出了通过对抗训练的正则化用于解决有标签样本和无标签样本的表征分布不一致问题。大量实验证明，在半监督的行为识别任务中，提出的方法与现有的半监督方法相比具有更好的性能。综上所述，本论文围绕人体的空间多视角感知、时空层次感知与、时空协同感知和半监督时空感知的科学问题展开了深入研究，并在多视角人体图像合成、人的行为识别和半监督行为识别等人体感知任务中取得了优越的性能。
英文摘要	Vision-based human perception is a complex and important research topic with important scientific value. According to research by Albert Mehrabian, the elements of personal communication include 7% spoken words, 38% voice, tone and 55% body language. Therefore, in vision-based human perception, the representation and understanding of human action is particularly important. Although recent studies have made considerable progress, these methods still have some limitations and there is still a lot of room for improvement in theoretical and technical research. Spatial and temporal information is the basic constituent element of human action. In an unrestricted environment, the complex states of human body in spatial and temporal domains bring huge challenges for vision-based human perception. The focus of this thesis is how to design effective methods to learn and understand the spatial and temporal information of human action. Specifically, the thesis mainly discusses three issues related to the understanding of spatiotemporal information: First, for the spatial multi-view perception, the pose understanding and representation in multi-view human action image synthesis are studied. Then, based on spatial-temporal hierarchical perception and spatial-temporal cooperative perception, the problem of automatically extracting spatiotemporal features from human action sequences is discussed. Finally, considering the phenomenon that recent human action perception methods heavily rely on the available manual annotations, it discusses the problem of semi-supervised spatiotemporal perception. The main contributions of this thesis are summarized as follows: Multi-view human action image synthesis involves the problem of spatial multiview perception in 2D space. Therefore, the thesis proposes a pose-based human image synthesis method. Through the analysis and understanding of low-dimensional human pose data, the proposed method can solve the challenges caused by the variability of human posture for human image synthesis. Besides, this method adopts the strategies of modular network and multistage adversarial learning to ensure the correct appearance of the generated human image. Specifically, the proposed method contains three networks for three stages. (1) In the first stage, a pose transformer network can synthesize 2D target pose of other perspectives from the condition pose. (2) In the second stage, given the predicted target pose, a foreground transformer network can synthesize the target human foreground with the condition human foreground. (3) In the third stage, a background transformer network is proposed to generate the target full image with the condition image and the generated foreground image as the input. The method adopts multistage adversarial losses separately for the foreground and background generation to overcomes the average prediction problem caused by mean square error, which contributes a lot to generate rich image details. Skeleton-based action recognition aims to explore the inherent motion characteristics from the given skeleton sequences. To solve the problem of how to efficiently model spatiotemporal features, this thesis proposes a novel model with hierarchical spatial reasoning and temporal stack learning networks to model the spatial structure and temporal dynamic in the way of spatial-temporal hierarchical perception. Human behavior is accomplished in coordination with each part of the body. For example, walking requires legs to walk, and it also needs the swing of arms to coordinate the body balance. The hierarchical spatial reasoning network employs a hierarchical residual graph neural network to mining the spatial structure dependence of human body, so as to effectively represent the spatial features. In addition, the temporal dynamics characteristics of human actions play another significant role in human action recognition. The temporal stack learning network models the detailed temporal dynamics of skeleton sequence. During training, a clip-based incremental loss is proposed to effectively optimize the model, which can effectively speed up convergence and improve the performance. Extensive experiments are performed on five challenging benchmarks to verify the effectiveness of the proposed method. The above work has shown that exploring spatial and temporal features of skeleton sequence is vital for action recognition. Despite its significant performance improvement, it ignores the co-occurrence relationship between spatial and temporal features. Considering the abundant body structural information within each skeleton frame, the temporal dependency between different frames and the co-occurrence relationship between spatial and temporal domains, this thesis proposes a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics, but also explore the co-occurrence relationship between spatial and temporal domains. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance the information of key joints in each AGC-LSTM layer, which assists in improving spatiotemporal representations. Recent methods for skeleton-based action recognition heavily rely on the available manual annotations that are costly to acquire. This thesis considers the problem of semi-supervised human action recognition. Self-supervised learning has been proved very effective at learning representations of unlabeled data via defining and solving various pretext tasks. This thesis proposes Adversarial Self-Supervised Learning (ASSL), a novel framework that tightly couples self-supervised learning and the semi-supervised scheme via neighbor relation exploration and adversarial learning. Specifically, an effective self-supervised learning is designed to improve the discrimination capability of learned representations for action recognition, through exploring the data relations within a neighborhood. Furthermore, an adversarial regularization is proposed to align the feature distributions of labeled and unlabeled samples. The comparison results confirm its advantageous performance over state-of-the-art semi-supervised methods in the few label regime for skeleton-based action recognition. In summary, this thesis focuses on the problems of spatial multi-view perception, spatial-temporal hierarchical perception, spatial-temporal co-occurrence perception and semi-supervised spatiotemporal perception. And it achieves significant performance in human perception tasks such as multi-view human action image synthesis, human action recognition and semi-supervised human action recognition.
关键词	多视角人体图像合成对抗学习人的行为识别人体姿态半监督学习
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/44443
专题	模式识别实验室
通讯作者	司晨阳
推荐引用方式 GB/T 7714	司晨阳. 基于时空信息分析的人的姿态与行为感知[D]. 中国科学院自动化研究所. 中国科学院大学,2021.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis__scy-基于时空信息分析（28881KB）	学位论文		开放获取	CC BY-NC-SA