基于多标签分类的属性识别问题研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于多标签分类的属性识别问题研究
	李乔哲
	2019-12-04
页数	100
学位类型	博士
中文摘要	视觉属性识别是计算机视觉领域中的一个重要研究方向，它在行人检索、场景理解和群体事件分析等任务中发挥着重要作用。属性识别本质上属于多标签分类任务，因此，如何准确地识别图像或视频中的多种属性标签是属性识别领域的关键问题。为了解决这一问题，本文对基于多标签分类的属性识别方法展开了一系列研究。在属性识别中，不同属性往往和不同的视觉信息相关，而一系列现实因素却对属性相关特征的有效提取带来了极大的干扰。因此，如何从图像或视频中提取有效的视觉信息是属性识别的首要难点。作为一种中层语义描述，属性间通常存在着复杂的视觉和语义关系。这种关系作为视觉信息的补充也为属性识别提供了重要的判断依据。因此，如何利用属性间的关系实现有效的属性关系推理也是属性识别的关键。针对上述属性识别任务的特点，本文从特征表达和属性关系建模这两方面出发提出了一系列属性识别的方法，并将方法应用到行人属性识别和群体属性识别任务。所开展的研究工作可以归纳如下： (1) 基于视觉-语义图推理的行人属性识别。本工作将属性识别这一多标签分类问题建模成了序列化的属性预测问题，并提出了一种基于视觉-语义图推理的框架来解决这一问题。本工作提出分别使用空间图来描述不同图像局部区域间的空间关系、使用语义图描述不同属性之间潜在的语义关系，并提出使用图卷积网络分别在空间图和语义图上实现推理。为了实现视觉-语义关系的协同建模，本工作提出了一种端到端的网络框架将空间图和语义图的表达互相嵌入到彼此的节点中以实现对彼此的引导学习。同传统的利用递归网络描述属性潜在高阶关系的序列化预测模型不同，本工作使用图卷积网络描述属性的成对关系进而可以实现更高效的推理过程。实验验证了本工作提出的视觉-语义图推理框架的有效性。 (2) 基于协同视觉-语义推理及知识蒸馏的行人属性识别。本工作主要围绕着如何实现更高效的视觉-语义关系推理模块和如何有效地利用人体的结构知识提升行人属性识别的效果而展开。同上一个工作相比，本工作提出了一种更高效的基于图模型的全局推理模块来建模行人属性间潜在的视觉-语义关系。为了利用属性间潜在的约束关系，本工作首先按照属性的语义特性或其描述身体区域的不同来将属性分组。随后，本工作将属性组在图上建模，并用图中的每一个节点代表某一组属性。为了弥合视觉特征和语义属性间存在的鸿沟，本工作提出通过映射函数将对应不同属性的视觉特征映射到图中不同的节点上。通过聚合多个局部区域的视觉特征作为语义节点的表达，不同的属性节点可以自适应地与相应的区域建立联系。在推理之后，可以使用不同的分类器对相应节点的属性进行分类。为了有效利用人体结构知识，本工作在推理模块的基础上添加了一个额外的约束项。该约束项通过蒸馏行人解析知识的方式来引导视觉-语义关系的推理过程，提升网络的表达能力。实验表明了本工作提出的识别模型的有效性。 (3) 基于空-时注意力机制的序列化群体属性识别。传统的群体属性识别方法通常利用不同的网络分支分别训练群体视频的空间特征或时序特征，这使得群体视频的空-时结构无法得到有效描述。此外，传统方法将基于多标签分类的群体属性识别问题建模成多个二分类任务的组合问题，进而忽略了对属性间关系的描述。考虑到现有方法的不足，本工作提出了一种基于空-时注意力机制的序列化群体属性识别模型。为了有效描述群体场景的空-时结构，本文提出使用卷积-长短时记忆网络实现群体场景的特征表达。为了描述属性间的语义关系及属性和空-时特征的关系，本工作提出了一种基于双向注意力机制的序列化预测模型实现对群体属性的有序预测。实验结果表明本工作提出的方法明显优于传统的群体属性识别方法。
英文摘要	Visual attribute recognition has been an important topic in computer vision area, which is closely related to a variety of applications including pedestrian retrieval, scene understanding, crowd behavior analysis, etc. Visual attribute recognition is intrinsically a multi-label classification problem, thus, how to accurately recognize the multiple attributes in images or videos is a key problem in attribute recognition. In attribute recognition, different attributes may be correlated to multiple visual cues in images or videos. However, due to a variety of realistic challenges, it's usually difficult to achieve effective feature representation. Besides, there exist potential spatial(spatiotemporal) and semantic relations of attributes. These relations may play as an important role for recognition, which are complementary to visual cues. According to the above-mentioned difficulties, we study attribute recognition from two perspectives: feature representation and relational modeling, and propose three methods which are verified on pedestrian attribute recognition and crowd attribute recognition tasks. Our work is concluded as follows: (1) Visual-semantic graph reasoning model. This work treats pedestrian attribute recognition as a sequential attribute prediction problem and proposes a novel visual-semantic graph reasoning framework to address this problem. Our framework contains a spatial graph and a directed semantic graph. By performing reasoning using the Graph Convolutional Network (GCN), one graph captures spatial relations between regions and the other learns potential semantic relations between attributes. An end-to-end architecture is presented to perform mutual embedding between these two graphs to guide the relational learning for each other. Unlike existing methods which employ RNNs to characterize latent high-order dependencies, pairwise relations can be captured in our proposed methods. Experiments show superiority of the proposed method over state-of-the-art methods and effectiveness of our joint GCN structures for sequential attribute prediction. (2) Pedestrian attribute recognition model based on joint visual-semantic reasoning and knowledge distillation. This work mainly focuses on how to achieve more efficient reasoning and how to exploit auxiliary human knowledge to boost attribute recognition performance. This work presents a more efficient graph-based global reasoning framework to jointly model potential visual-semantic relations of attributes and distill auxiliary human parsing knowledge to guide the relational learning. The reasoning framework models attribute groups on a graph and learns a projection function to adaptively assign local visual features to the nodes of the graph. After feature projection, graph convolution is utilized to perform global reasoning between the attribute groups to model their mutual dependencies. Then, the learned node features are projected back to visual space to facilitate knowledge transfer. An additional regularization term is proposed by distilling human parsing knowledge from a pre-trained teacher model to enhance feature representations. Experiments show that our method achieves state-of-the-art results. (3) Recurrent crowd attribute prediction model based on spatial-temporal attention. Traditional deep methods directly treat this recognition problem as a multiple binary classification problem, and represent video by vectorizing and fusing the separately learned spatial and temporal features in the fully connected layers. Therefore, the correlations between these attributes may not be well captured. In this work, a bidirectional recurrent prediction model with a semantic aware attention mechanism is proposed to explore the spatio-temporal and semantic relations between attributes for more accurate recognition. The ConvLSTM is introduced for feature representation to capture the spatio-temporal structure of crowd videos and facilitate visual attention. A bidirectional recurrent attention module is proposed for sequential attribute prediction by associating each subcategory attributes to semantic related regions iteratively. Experiments show that our approach significantly outperforms state-of-the-art methods.
关键词	属性识别多标签分类行人属性识别群体属性识别
语种	中文
七大方向——子方向分类	图像视频处理与分析
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/28372
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	李乔哲. 基于多标签分类的属性识别问题研究[D]. 北京. 中国科学院自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于多标签分类的属性识别问题研究-李乔哲（10540KB）	学位论文		限制开放	CC BY-NC-SA