Other Abstract | Vision-based human perception is a complex and important research topic with important scientific value. According to research by Albert Mehrabian, the elements of personal communication include 7% spoken words, 38% voice, tone and 55% body language. Therefore, in vision-based human perception, the representation and understanding of human action is particularly important. Although recent studies have made considerable progress, these methods still have some limitations and there is still a lot
of room for improvement in theoretical and technical research.
Spatial and temporal information is the basic constituent element of human action. In an unrestricted environment, the complex states of human body in spatial and temporal domains bring huge challenges for vision-based human perception. The focus of this thesis is how to design effective methods to learn and understand the spatial and temporal information of human action. Specifically, the thesis mainly discusses three issues related to the understanding of spatiotemporal information: First, for the spatial multi-view perception, the pose understanding and representation in multi-view human action image synthesis are studied. Then, based on spatial-temporal hierarchical perception and spatial-temporal cooperative perception, the problem of automatically extracting spatiotemporal features from human action sequences is discussed. Finally, considering the phenomenon that recent human action perception methods heavily rely on the available manual annotations, it discusses the problem of semi-supervised spatiotemporal perception. The main contributions of this thesis are summarized as follows:
-
Multi-view human action image synthesis involves the problem of spatial multiview perception in 2D space. Therefore, the thesis proposes a pose-based human image synthesis method. Through the analysis and understanding of low-dimensional human pose data, the proposed method can solve the challenges caused by the variability of human posture for human image synthesis. Besides, this method adopts the strategies of modular network and multistage adversarial learning to ensure the correct appearance of the generated human image. Specifically, the proposed method contains three networks for three stages. (1) In the first stage, a pose transformer network can synthesize 2D target pose of other perspectives from the condition pose. (2) In the second stage, given the predicted target pose, a foreground transformer network can synthesize the target human foreground with the condition human foreground. (3) In the third stage, a background transformer network is proposed to generate the target full image with the condition image and the generated foreground image as the input. The method adopts multistage adversarial losses separately for the foreground and background generation to overcomes the average prediction problem caused by mean square error, which contributes a lot to generate rich image details.
-
Skeleton-based action recognition aims to explore the inherent motion characteristics from the given skeleton sequences. To solve the problem of how to efficiently model spatiotemporal features, this thesis proposes a novel model with hierarchical spatial reasoning and temporal stack learning networks to model the spatial structure and temporal dynamic in the way of spatial-temporal hierarchical perception. Human behavior is accomplished in coordination with each part of the body. For example, walking requires legs to walk, and it also needs the swing of arms to coordinate the body balance. The hierarchical spatial reasoning network employs a hierarchical residual graph neural network to mining the spatial structure dependence of human body, so as to effectively represent the spatial features. In addition, the temporal dynamics characteristics of human actions play another significant role in human action recognition. The temporal stack learning network models the detailed temporal dynamics of skeleton sequence. During training, a clip-based incremental loss is proposed to effectively optimize the model, which can effectively speed up convergence and improve the performance. Extensive experiments are performed on five challenging benchmarks to verify the effectiveness of the proposed method.
-
The above work has shown that exploring spatial and temporal features of skeleton sequence is vital for action recognition. Despite its significant performance improvement, it ignores the co-occurrence relationship between spatial and temporal features. Considering the abundant body structural information within each skeleton frame, the temporal dependency between different frames and the co-occurrence relationship between spatial and temporal domains, this thesis proposes a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics, but also explore the co-occurrence relationship between spatial and temporal domains. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance the information of key joints in each AGC-LSTM layer, which assists in improving spatiotemporal representations.
-
Recent methods for skeleton-based action recognition heavily rely on the available manual annotations that are costly to acquire. This thesis considers the problem of semi-supervised human action recognition. Self-supervised learning has been proved very effective at learning representations of unlabeled data via defining and solving various pretext tasks. This thesis proposes Adversarial Self-Supervised Learning (ASSL), a novel framework that tightly couples self-supervised learning and the semi-supervised scheme via neighbor relation exploration and adversarial learning. Specifically, an effective self-supervised learning is designed to improve the discrimination capability of learned representations for action recognition, through exploring the data relations within a neighborhood. Furthermore, an adversarial regularization is proposed to align the feature distributions of labeled and unlabeled samples. The comparison results confirm its advantageous performance over state-of-the-art semi-supervised methods in the few label regime for skeleton-based action recognition.
In summary, this thesis focuses on the problems of spatial multi-view perception, spatial-temporal hierarchical perception, spatial-temporal co-occurrence perception and semi-supervised spatiotemporal perception. And it achieves significant performance in human perception tasks such as multi-view human action image synthesis, human action recognition and semi-supervised human action recognition. |
Edit Comment