人脸和行人图像属性识别研究
谭资昌
2021-05
页数124
学位类型博士
中文摘要

以人为中心的属性分析是计算机视觉领域的重点研究内容。根据输入数据类型,以人为中心的属性分析可分为人脸属性识别和行人属性识别,二者分别采用人脸图像和行人图像作为输入。通过人脸和行人属性识别可以获得目标人物的年龄、性别、服装类型、背包类型等高层语义描述,在公安刑侦、人物检索、人机交互等实际场景中有着广泛的应用。本文围绕人脸和行人属性识别两个任务展开,采用深度学习的研究方法,对当前存在的一些难点和痛点问题进行了系统且深入的研究。本文主要贡献如下:

1. 针对相邻年龄间的相关性和相似性难题,提出了一个基于年龄组编解码框架的年龄估计方法提出了一个基于年龄组编解码框架的年龄估计方法。首先,提出一个年龄组编码策略,对所有的年龄进行重叠式地、有规律地划分,使得相邻年龄被分到同一个年龄组。随后,采用多输出的年龄组分类网络进行年龄组分类,在分类过程中,同一个年龄组的图片被当作一个类别,以探索相邻年龄之间相关性、相似性。最后,基于年龄组编码策略中年龄与年龄组之间的对应关系,提出了全局年龄解码和局部年龄解码策略,可以根据粗略的年龄组预测结果解码得到精细的预测年龄,并且局部年龄解码在不损失预测准确度的情况下进一步提升了解码效率。

2. 针对大多数已有人脸属性识别算法强调全局特征而忽略局部细节信息,提出了一个基于深度混合特征的人脸属性识别方法。在该方法中,提出了一个深度混合对齐网络,其包含了三个分支来针对不同的人脸区域(包括全局和局部人脸区域)学习全局、局部和全局-局部特征,不仅能有效学到人脸的全局整体信息,同时也能保留局部的细节知识,对人脸属性识别性能的提升有着巨大的帮助。考虑到即使输入图片进行了对齐,其它人脸区域也可能没有完全对齐,提出了对齐区域池化层来产生对齐的人脸区域特征。此外,还提出了在每个子网络上添加独立的损失函数来学习独立的特征,采用循环融合方式来挖掘不同人脸区域间的潜在相关性。

3. 针对采集的行人图像往往存在较大的姿态、相机角度、遮挡等变化,提出了一个基于注意力机制的行人属性识别方法提出了一个基于注意力机制的行人属性识别方法。在该框架中,提出了解析、标签和空间三种注意力机制,并集成到一个网络中进行联合优化。解析注意力主要是利用行人解析技术获得像素级的行人解析信息,以指导有效人体部件区域信息的提取;标签注意力和空间注意力分别着眼于单个和所有属性,学习一些注意力掩码来加强属性相关区域并忽略无关区域的特征学习。

4. 针对行人属性识别中相关性探索难题,提出了一个利用图卷积神经网络探索相关性的行人属性识别方法提出了一个利用图卷积神经网络探索相关性的行人属性识别方法。在该方法中,提出了属性相关性模块和上下文相关性模块,分别探索多属性相关性和上下文相关性。具体地,在属性相关性模块中,首先针对每个属性采用限制损失函数学习属性分离的特征,并将每个属性特征看作一个图节点来构建属性图;在上下文相关性模块中,提出了一个图映射策略将二维平面特征映射成多个图节点,使每个节点对应一些图像区域或像素。在构建图网络之后,两个模块均采用图卷积神经网络进行相关性探索,进而实现更为可靠的行人属性识别。

英文摘要

Human-centric attribute analysis is one important research part in the area of computer vision. According to the type of input data, it can be divided into Face Attribute Recognition (FAR) and Pedestrian Attribute Recognition (PAR), which takes face and pedestrian images as inputs, respectively. The high-level semantic descriptions (e.g., age, gender, clothing type, bag type) can be obtained by analyzing attributes of a face or a pedestrian, which has a wide range of practical applications in public security criminal investigation, face or person retrieval, human-computer interaction and so on. In this dissertation, we aim to make some in-depth and systematic researches on some current difficulties and pain points by using deep learning methods in FAR and PAR. The main contributions are as follows:

1. Aiming at the correlation and similarity among adjacent ages, we propose a new age group encoding and decoding framework for facial age estimation. Firstly, an age group-n encoding (AGEn) strategy is proposed to divide ages into different age groups, where the adjacent ages are divided into the same age group. Then, a network with multiple outputs is proposed for age group classification. In this stage, the images of the same age group are treated as a category, where the correlation and similarity among adjacent ages can be explored. Finally, we propose a Global Age Decoding (GAD) and a Local Age Decoding (LAD) to decode the predicted age from results of age group classification. Compared with GAD, LAD can further improve the decoding efficiency without losing the prediction accuracy.

2. In view of that most of existing studies in FAR emphasize global semantics while ignoring local details, a new framework based on deeply-learned hybrid representations is proposed. In this framework, a Deep Hybrid-Aligned Architecture (DHAA) is proposed, which contains three branches to learn global, local and global-local features from different face regions (including global and local face regions). It can not only effectively learn the global information of the face, but also retain the local details, which is of great help to improve the performance of FAR. Note that even if the input image is aligned, other face regions may not be fully aligned. To generate aligned face features for all regions, an Aligned Region Pooling (ARP) is proposed. In addition, an independent loss function is added to each sub-network to learn independent features, and a recurrent fusion method is used to explore potential correlations among different face regions.

3. Aiming at the problem of large variations (e.g., poses, camera angles and occlusions) in pedestrian images, we propose an attention-based framework for PAR. In this framework, three attention mechanisms including Parsing Attention (PA), Label Attention (LA) and Spatial Attention (SA) are proposed, and all of them are jointly learnt and optimized (denoted as Joint Learning of Parsing attention, Label attention, and Spatial attention, JLPLS). For PA, it first uses a pedestrian parsing network to generate pixel-level parsing maps, which are then used to guide the network learn discriminative features from some important regions. Besides, both LA and SA, which focus on each attribute and all attributes, respectively, learn to enhance the effective features from important regions while suppressing the irrelevant features by using some attention masks.

4. To capture the relations (including attribute and contextual relations) in PAR, we construct a new network based on Graph Convolution Networks (GCN), where Attribute Relation Module (ARM) and Contextual Relation Module (CRM) are proposed to learn attribute and contextual relations, respectively. In ARM, the constrained losses are employed to learn attribute-specific features, which constructs an attribute graph with each node denoting a specific attribute. In CRM, it employs a graph projection scheme to project the 2-D feature map into a set of nodes with each node representing several image regions/pixels. After the construction of graph network, both ARM and CRM use GCN to capture relations, which helps the network to achieve a more reliable recognition.

关键词人脸属性识别 行人属性识别 深度学习 图像识别
语种中文
七大方向——子方向分类图像视频处理与分析
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/45034
专题模式识别国家重点实验室_生物识别与安全技术
推荐引用方式
GB/T 7714
谭资昌. 人脸和行人图像属性识别研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis-谭资昌.pdf(6904KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[谭资昌]的文章
百度学术
百度学术中相似的文章
[谭资昌]的文章
必应学术
必应学术中相似的文章
[谭资昌]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。