基于自适应细粒度语义对齐的行人重识别研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于自适应细粒度语义对齐的行人重识别研究
	朱宽
	2023-05
页数	146
学位类型	博士
中文摘要	行人重识别任务旨在匹配同一个行人在不同摄像头、不同场景下的所有图片。随着视觉监控数据的指数级增长，行人重识别技术逐渐展现出了其广泛的应用场景和重大的研究价值，可以在安防、刑侦和智慧城市中提供重要的技术支持。行人重识别由于其任务的特性，天然要面对行人姿态差异、视角差异、障碍物遮挡、行人检测器误差等干扰因素，这些因素都会导致行人重识别领域最具挑战性的问题之一：语义不对齐问题。这促使研究学者开始研究基于语义对齐的行人重识别方法，即首先定位局部语义位置，然后提取对应的局部语义特征，来实现局部特征层面的语义对齐。但是现有的方法要么只能粗糙地定位局部语义，要么不能识别对于行人重识别任务至关重要的非人体语义部分（例如背包、手提包等），即无法实现自适应、细粒度的语义对齐以应对不可预知的、复杂的应用场景。另外，随着视觉Transformer新型网络结构的出现，如何利用Transformer的结构优势以及其对数据强大的适配能力为自适应细粒度语义对齐带来新的解决方案和性能提升，也是当前行人重识别领域亟需解决的核心问题。本文基于深度学习技术，针对以上问题分别提出了相对应的解决方案，主要创新点包括： 1. 针对现有方法存在局部语义定位粗糙以及无法识别非人体语义的问题，本文提出了基于身份指导人体语义解析学习的行人重识别方法，其可以仅利用图片级别的ID标注就在像素级别同时定位人体和非人体局部语义。本方法设计了级联聚类模块在图像特征图上生成人体局部语义的伪标签。具体地，对于同一个行人所有图片的特征图上的像素，级联聚类首先根据响应度将它们划分到前景或者背景类别，然后将前景类别中的像素聚类为若干个不同的局部语义部分。聚类的结果随后被当作行人局部语义的伪标签来监督行人局部语义估计的学习。最终，人体和非人体的局部语义特征都会根据网络预测的局部语义位置获得。本方法迭代地进行伪标签的产生和网络的优化，这种迭代方式可以让两个模块的性能循环交替上升。在进行图片检索的时候，只有查询图片和图库集图像的共同可见语义部分参与相似度的计算。大量的实验验证了本方法在各大数据集上的突出性能。 2. 为了更高效地实现自适应细粒度语义对齐，本文提出了基于语义一致水平块和语义自精细化的行人重识别方法。其中，语义一致水平块模块负责自适应地根据局部语义位置将输入图片切分成水平块，其中每一个水平块对应一个确定的语义。具体地，本方法对图像特征图的每一行进行聚类并得到每一行的伪标签，然后利用该伪标签，学习一个对图片进行切分的行分类器。同样的，本方法迭代地进行伪标签的产生和网络优化，使它们的性能循环交替上升。此外，本文设计语义自精细化模块在线地去除水平块中的背景噪声。具体地，所有的像素均通过行分类器得到其属于行人局部语义部分（前景）或者背景的概率，输出的结果被称为类激活图。只有在类激活图中置信度最高的区域被分配前景或者背景伪标签来监督语义自精细化的学习。最终，通过计算语义一致水平块和前景区域的交集，本方法可以获得像素级别的行人局部语义定位，进而提取到细粒度局部语义特征。实验结果表明本方法不仅可以更高效地实现细粒度语义对齐，也会进一步提升行人重识别的性能。 3. 针对现有的基于Transformer网络的行人重识别方法无法实现语义对齐的问题，本文提出了自动语义对齐的Transformer网络，其可以在线地、自动地在图像块级别定位行人局部语义并提取对应的局部特征。首先，本文最先提出了``局部令牌''的概念，其由可学习向量构成，在自注意力机制的计算过程中，一个局部令牌只和某个图像块子集进行交互而不是所有的图像块，因此可以为Transformer学习该图像块子集的局部特征表达。然后，为了实现自适应的图像块子集划分，本文设计了自动语义对齐的Transformer网络。具体地，AAformer将局部令牌看作局部语义的类别原型，并使用了一种快速的最优传输算法在线地将图像块分配到局部令牌上，以使包含相同语义的图像块聚集到同一个局部令牌中。最终，本方法和谐地将局部语义对齐过程整合到自注意力计算过程中。实验结果验证了局部令牌的有效性和本方法的性能优势。 4. 针对现有的自监督预训练方法无法为行人重识别任务提供细粒度局部语义特征的问题，本文提出了局部语义感知的行人重识别自监督预训练方法，其专注于行人重识别任务，可以通过预训练赋予模型自适应提取细粒度局部特征的能力。首先，本方法会将行人图片划分为若干个局部区域，然后从同一个局部区域裁剪出的局部视图会被分配同一个特定的局部令牌。{同时，从整张图像上裁剪出的全局视图则被分配所有的局部令牌}。本方法学习匹配从局部视图和全局视图输出的同一个局部令牌，换句话说，从局部视图中输出的局部令牌只需要学习匹配从全局视图中输出的那一个对应的局部令牌，而不是所有的局部令牌。作为结果，每一个局部令牌都可以专注到一个特定的局部区域并从该区域提取细粒度局部特征。实验结果表明经过本方法预训练过的模型可以在各大下游行人重识别任务上获得当前最好的性能。本文针对行人重识别领域中最具挑战的语义不对齐问题，设计出了一系列基于自适应细粒度语义对齐的行人重识别方法，不断地提升了行人重识别模型的性能，并在一定范围内吸引了相关研究学者的关注。
英文摘要	Person Re-Identification (ReID) aims to match all images of a person under different cameras and scenes. It can provide important technical support in security, criminal investigation and smart cities. With the exponential growth of visual surveillance data, person ReID has gradually demonstrated its important research implications and a wide range of applications. Due to the characteristics of this task, person ReID naturally has to face the interference factors such as person pose variations, viewpoint variations, obstacle occlusion, the errors of person detector, etc. These factors will lead to one of the most challenging problems in this task: semantics misalignment. Extracting partial features for person images has been validated to be effective to alleviate this problem, that is, first locating the position of local semantics, then extract the partial features, to realize the semantics-aligned feature matching. However, the existing methods can only locate local semantics roughly, or cannot identify non-human semantic parts (such as backpacks, handbags) that are very important for person ReID. In other words, the existing methods can not achieve adaptive fine-grained semantics alignment to deal with the unpredictable and complex application scenarios. In addition, with the appearance of vision Transformer, how to use its structural advantages and strong adaptability on data to bring new solutions to fine-grained semantic alignment and obtain performance improvement are also the urgent needs in current person ReID. Based on deep learning technology, this dissertation proposes corresponding solutions to the above problems. The main innovations include: 1. For solving the problems of rough local semantics positioning and inability to recognize the non-human semantics in existing methods, this dissertation proposes the identity-guided human semantic parsing approach (ISP) to locate both the human body parts and personal belongings at pixel-level for aligned person re-ID only with person identity labels. ISP designs the cascaded clustering on feature maps to generate the pseudo-labels of human parts. Specifically, for the pixels of all images of a person, cascaded clustering first groups them to foreground or background and then group the foreground pixels to human parts. The cluster assignments are subsequently used as pseudo-labels of human parts to supervise the part estimation. Finally, local features of both human body parts and personal belongings are obtained according to the self-learned part estimation. ISP iteratively learns the feature maps and generates the pseudo-labels, thus the performance of the two modules is alternately increased. In testing, only features of visible parts are utilized. Extensive experiments validate the superiority of ISP in performance. 2. To achieve adaptive fine-grained semantics alignment more efficiently, this dissertation proposes the semantics-consistent stripes with foreground refinement (SCS+) algorithm for person re-identification, and make two contributions. (i) A semantics-consistent stripe learning method (SCS). Given an image, SCS partitions it into adaptive horizontal stripes and each stripe is corresponding to a specific semantic part. Specifically, SCS first Clusters the rows to human parts or background to generate the pseudo-part-labels of rows. Then, SCS learns a row classifier to partition the person images, which is supervised by the latest pseudo-labels of rows. SCS iteratively conducts the above two processes and this iterative scheme make the accuracy of the two modules alternately increased. (ii) A self-refinement method (SCS+) to remove the background noise in stripes. SCS+ employ the above row classifier to generate the probabilities of pixels belonging to human parts (foreground) or background, which is called the Class Activation Map (CAM). Only the most confident areas from the CAM are assigned with fore-/back- ground labels to guide the human part refinement. Finally, by intersecting the semantics-consistent stripes with the foreground areas, SCS+ locates the human parts at pixel-level, obtaining a more robust part-aligned representation. Extensive experiments validate that SCS+ can not only achieve the fine-grained semantics alignment more efficiently, but also further improve the performance of person ReID. 3. For solving the problem that none of the existing Transformer-based ReID methods can achieve semantics alignment, this dissertation proposes the Auto-Aligned Transformer (AAformer) to automatically locate both the human parts and non-human ones at patch-level and extracting the corresponding partial features. First, this dissertation introduces the ``Part Token ([PART])'', which are learnable vectors, to extract part features in Transformer. A part token only interacts with a local subset of patches in self-attention and learns to be the part representation. Then, to adaptively group the image patches into different subsets, this dissertation design the AAformer. AAformer employs a fast variant of Optimal Transport algorithm to online cluster the patch embeddings into several groups with the part tokens as their prototypes. AAformer harmoniously integrates the part alignment into the self-attention. Extensive experiments validate the effectiveness of part tokens and the superiority of AAformer over various state-of-the-art methods. 4. For solving the problem that existing self-supervised pre-training methods cannot provide fine-grained partial features for ReID tasks, this dissertation proposes a ReID-specific pre-training method, Part-Aware Self-Supervised pre-training (PASS), which can generate part-level features to offer fine-grained information and is more suitable for ReID. PASS divides the images into several local areas, and the local views randomly cropped from each area are assigned a specific learnable [PART] token. On the other hand, the $global$ views cropped from the original image are assigned with all the [PART]s. PASS learns to match the outputs of the local views and global views on the same [PART]. That is, the learned [PART] of the local views from a local area is only matched with the corresponding [PART] learned from the global views. As a result, each [PART] can focus on a specific local area of the image and extracts fine-grained information of this area. Experiments show PASS sets the new state-of-the-art performances on various ReID tasks like supervised/UDA/USL ReID. To alleviate the most challenging problem of semantic misalignment in person re-identification, this dissertation designs a series of methods based on adaptive fine-grained semantics alignment, which continuously improve the performance of the person re-identification model and have attracted the attention of relevant researchers to a certain extent.
关键词	行人重识别语义对齐伪标签生成 Transformer 网络自监督学习
语种	中文
七大方向——子方向分类	目标检测、跟踪与识别
国重实验室规划方向分类	视觉信息处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/51926
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	朱宽. 基于自适应细粒度语义对齐的行人重识别研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
博士毕业论文_明版_最终版_2_0.pd（6509KB）	学位论文		限制开放	CC BY-NC-SA