面向计算机视觉的神经网络架构设计 | |
李志航![]() | |
2021-05-22 | |
Pages | 158 |
Subtype | 博士 |
Abstract | 神经网络架构设计是近年来机器学习和计算机视觉领域最热门的研究方向之一。有效网络架构的提出极大地推动了分类、识别和检测等下游任务的发展。例如,残差网络实现了上百层深度网络的训练,在大规模的图像分类数据集上的性能超过人类的准确率。深度学习模型已经成为视觉识别的引擎,神经网络架构设计对深度学习模型的性能起至关重要的作用。然而,目前的神经网络架构设计在理论和应用上仍然面临着许多挑战和问题。比如:针对领域问题的定制化神经网络架构设计仍然是一个相对困难的问题,自动化网络架构搜索面临着搜索速度慢的问题,当前的网络架构在非受控场景下的鲁棒性仍远远落后于人。针对这些挑战,本文从神经网络架构设计和它们在开放场景中的应用展开研究。本文取得的主要研究成果如下: 1.提出了一种基于高斯过程的网络架构搜索框架,用来降低架构搜索计算的复杂度。该框架利用高斯过程理论建模给定网络架构下的模型性能分布,其中性能和架构之间的相关性用均值函数建模,核函数用于度量搜索空间中任意两个网络架构之间的相关性。为了加速高斯过程模型超参数的估计,进一步提出了基于互信息最大化的采样策略,它能够增大已采样样本之间的距离,并从理论上证明使用最少的子网络个数高效地估计模型超参数。进而提出了一种交替优化策略,它能够迭代地更新参数的后验分布,指导模型逐步采样和优化。因此,该模型具有较强的迁移能力,可以实现在小数据集上学习的参数迁移到大规模数据集上,极大的节省了搜索时间。训练后的模型能够实现对任意网络架构的性能估计,且搜索过程与参数估计无关,当搜索目标改变时,无需重复训练子网络和搜索模型。实验表明,该模型能够极大的缩短搜索时间,能够在分类和识别任务上搜索到比手工设计网络更加有效的模型。 2.提出了一种基于模拟退火的特征金字塔网络架构搜索模型,实现了在大规模目标检测任务上的网络架构搜索。针对目标检测中特征金字塔的设计空间巨大,且不同的任务和数据集中最优的特征金字塔往往不同的问题,本文提出了一种数据驱动的自动化特征金字塔网络设计框架。该框架首先设计了一种层次化的搜索空间,包含外部的拓扑结构搜索和内部的融合单元搜索,其中外部搜索关注于输入特征的选择,而内部融合单元关注于搜索特征融合的方法,其中包括无参数的池化操作、逐元素相加操作和不同的卷积操作,该搜索空间将现有的人工设计的结构也包含在内。其次,为了实现在大规模数据集上高效搜索,本文提出了一种基于概率优化的模拟退火算法,通过温度参数能够动态调节算法收敛速度,可以避免优化器陷入局部极小值点。该方法是一个通用的金字塔网络搜索框架,可以应用于当前主流的单阶段和两阶段目标检测器中。实验表明,该方法能够针对不同输入分辨率和多种主干网络架构,仅需2.2个GPU天自适应地搜索出更优的结构,进一步提升目标检测器的性能。 3.提出了两种基于人工经验的神经网络架构,用于处理极端场景下的人脸检测和补全任务。第一种是基于丰富语义特征的金字塔网络结构,提高人脸检测的特征表达能力。由于不同大小的人脸在不同感受野的特征上检测,浅层特征缺乏语义信息,而深层特征缺乏细节信息。提出的基于注意力机制的多层级特征融合架构能够将浅层特征和深层特征通过跳层连接联系起来,同时加入各层特征的重要性参数,自适应地融合各个层级的特征。该方法在四个标准的人脸检测数据集上都取得了最好的效果,特别是针对小人脸的检测。第二种是基于特征解耦和融合的遮挡人脸的恢复算法,该方法采用了编码器-解码器结构。在解耦网络中,遮挡人脸被编码器解耦为清晰人脸和遮挡图像的隐表达,解码器从隐特征中逐步恢复图像的分辨率。对于任意的清晰人脸和遮挡图像,融合网络学习拼接两者得到对应的遮挡人脸。解耦和融合的过程被统一到对偶学习的框架中,并使用无监督方式训练网络模型。实验从定性和定量两个角度验证了提出框架的有效性。 |
Other Abstract | Neural architecture design is one of the most significant research directions in machine learning and computer vision. Over the past several years, better architectures have resulted in considerable progress in image classification, object detection, semantic segmentation and etc. For example, ResNet successfully trains neural networks with 100 layers and 1000 layers, surpassing human performance in large-scale image classification. Deep convolutional neural networks have been the engine of visual recognition, which is greatly attributed to effective neural network architectures. However, there still exist many challenges in theory and application of neural network architecture design, such as specially designed network for a particular field, expensive computation of neural architecture search, and the robustness of neural network in hard cases. To address these problems, we investigate neural architecture design from both handcraft and automatic ways, and their application in uncontrolled scenarios. The main work in this paper is summarized as follows: 1. We present a novel Gaussian Process based neural architecture search (GP-NAS) to reduce computational complexity of architecture search. Gaussian Process (GP) is introduced to model the distribution of performances conditioned on given architectures. Under this framework, the correlation between performances and architectures can be modeled by a mean function in the GP. The correlation between different architectures is measured by the kernel functions as well. In order to speed up the search process, we further propose an efficient mutual information based sampling method. This sampling method intuitively enlarges the distance between obtained samples. We theoretically prove that only with a small set of samples, we can obtain a universal performance predictor for given search space with high accuracy. The GP-NAS can be solved in an alternating estimation manner which recursively updates the posterior distribution of learnable hyperparameters and enables us to approach the optimal distribution of performances conditioned on given architectures. This further ensures a good transferable property of our GP-NAS framework. After updating the hyperparameters on CIFAR-10 dataset, we can use them as a prior knowledge of the learning problem on ImageNet dataset. The prior knowledge enables the distribution of hyperparameters to be quickly adapt to the new problem with fewer samples. GP-NAS disentangles the training and search process in these scenarios. Therefore, GP-NAS can ensure efficient deployment of effective deep models for different tasks and different platforms without retraining and re-search process. Experiments demonstrate that the network architecture searched by GP-NAS achieves competitive results on CIFAR-10 and ImageNet with a high efficiency. 2.We present a Simulated Annealing-based Network Architecture Search method (SA-NAS) to automatically search a feature pyramid network for object detection directly on the challenging dataset COCO. Since the design space of the feature pyramid network is so large that the optimal structure for varied tasks is hard to find by hand, we propose a data-driven automatically feature pyramid network search framework. In order to search in-cell structure and outer topology at the same time, we design a new combinatorial search space for the feature pyramid network. The outer search focuses on the selection of input features, while the inner cell search the fusing operation, including pooling, element-wise summation and convolution. Our search space is general and includes most handcraft feature pyramid networks. Aiming at searching architectures directly on the COCO dataset at affordable cost, we introduce a fast Simulated Annealing (SA) algorithm. The convergence speed of algorithm can be dynamically adjusted by the hyperparameter temperature, which is beneficial to avoid falling into a local minimum. SA-NAS is a general feature pyramid network search framework, which can be directly applied in current mainstream one-stage and two-stage detectors. Experiments on COCO demonstrate that our AutoDet spends $2.2$ GPU days and outperforms other state-of-the-art one-stage and two-stage approaches under different input resolutions and backbones. 3. We propose two neural networks for tiny face detection and occluded face completion in uncontrolled scenarios. The first one is named attention-guided semantically enriched feature aggregation architecture, which improve the representation of face detection. Since objects with various scales are detected on distinct layers, the features of shallower layers are semantically deficient and high-level feature maps are lack of spatial details. We present a multi-level feature aggregation framework to directly integrate the shallower and deeper layers by skip connection. In addition, an attention mechanism is employed as a gate to emphasize relevant features and suppress useless features during feature fusion. Extensive experiments across different aggregation architectures on four challenging face detection benchmarks demonstrate the superiority of our framework over state-of-the-art methods. The second method is learning disentangling and fusing networks for face completion under structured occlusions, which adopt an encoder-decoder structure. In disentangling network, an occluded face is encoded to the disentangled representations by an encoder, and two decoders then generate the corresponding clean face and occlusion respectively. For any samples in the domains of clean faces and occlusions, the fusing network simply concatenates their latent representations and then synthesizes the corresponding occluded face. The disentangling and fusing processes can be unified into a dual learning framework and trained with an unsupervised learn manner. Quantitative and visual evaluations demonstrate the effectiveness of face completion under structured occlusions. |
Keyword | 神经网络架构设计 网络架构搜索 图像分类 目标检测 |
Language | 中文 |
Sub direction classification | 图像视频处理与分析 |
Document Type | 学位论文 |
Identifier | http://ir.ia.ac.cn/handle/173211/44400 |
Collection | 毕业生_博士学位论文 |
Corresponding Author | 李志航 |
Recommended Citation GB/T 7714 | 李志航. 面向计算机视觉的神经网络架构设计[D]. 中国科学院自动化研究所. 中国科学院大学,2021. |
Files in This Item: | ||||||
File Name/Size | DocType | Version | Access | License | ||
明版.pdf(35133KB) | 学位论文 | 限制开放 | CC BY-NC-SA |
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment