面向高效图像理解的神经网络结构搜索算法研究

CASIA OpenIR

面向高效图像理解的神经网络结构搜索算法研究

俞宏远

2022-05-21

页数

146

学位类型

博士

中文摘要

近些年来，以深度神经网络为代表的深度学习技术在计算机视觉领域取得了一系列突破性进展，在图像识别、人脸识别等一些任务中甚至已经超越了人类的表现，取得的种种卓越成就很大程度得益于新神经网络结构的出现。但它依赖人类专家的经验和专业知识，且需反复实验，故而计算成本也较高。随着物联网的蓬勃发展，以智能手机和无人车为代表的嵌入式设备对图像理解算法的实时性也有较高要求。尽管深度神经网络在很多图像理解任务上有着非常优越的性能优势，但是昂贵的计算代价使其难以大规模应用到生产和生活场景中。因此，如何根据不同任务设计与之相匹配的高效神经网络结构是一个重要的研究课题，本文围绕神经网络结构搜索算法和高效图像理解应用的实际需求开展以下创新性研究，具体贡献概括如下：

提出循环可微分神经网络结构搜索算法。可微分搜索算法相较传统神经网络搜索算法具有极高的搜索效率。在搜索过程中，可微分搜索算法会率先在浅层搜索网络中找到一个最佳结构，然后再根据该最佳结构构建深度的评测网络。这种方式会使搜索和评测过程独立，从而使搜索到的结构性能表现较差。针对如上问题，本文提出循环可微分搜索算法。该算法在搜索网络和评测网络之间构建信息反馈通路：首先，搜索网络产生一个初始结构，而后基于该结构构建出评测模型；其次，将训练好的评测模型作为老师模型，再结合有监督的标签训练搜索模型，循环以上过程直至收敛。这种方式可以实现搜索模型和评测模型的联合优化，缓解可微分搜索中因搜索和评测模型独立优化带来的问题，并在图像分类的多个数据集中验证了该算法的有效性。

提出基于采样的自蒸馏神经网络结构搜索算法。基于采样的一次性神经网络结构搜索算法由于在超网中使用了权重共享策略大幅缩短了网络结构性能评测时间，但权重共享也会导致超网中的每个子网络参数高度耦合，使得超网中的子网络无法得到充分训练，导致评测时的子网络性能并不能反应其真实性能，进而影响搜索效率。由此本文提出自蒸馏神经网络结构搜索算法，该算法在训练过程中动态选择性能最优子网络作为老师模型指导超网的训练，从而缓解了因超网训练不充分导致的搜索效率低下问题，并在图像分类、目标检测和语义分割中验证了该算法的有效性。

提出基于神经网络结构搜索的文本检测跟踪算法。传统神经网络结构搜索算法一般是在图像分类任务上进行搜索，然后再将搜索到的网络结构迁移到不同的图像理解任务中。面向单个非目标任务的搜索可能会导致搜索不到最适合的网络结构，且无法充分利用多个任务的监督信息。本文面向视频文本检测任务，提出端到端视频文本检测跟踪模型搜索算法，该算法的特征描述子模块实现了检测和跟踪两个任务的联合建模，在此基础上再进行主干网络结构搜索可进一步提升模型性能。

提出面向图像分割的多目标多分支神经网络结构搜索算法。传统神经网络结构搜索算法的搜索空间往往只包含单个分支模型，且搜索算法只有性能优化目标，如此会局限搜索到的模型应用场景，降低搜索算法效率。本文面向分割任务提出适应分割任务的新型搜索空间，该搜索空间包含轻量级自注意力算子和轻量级卷积算子，且可以搜索不同分辨率分支之间的各种组合。此外，该算法是多目标搜索算法，包含模型推理速度、模型大小和性能。与之前的搜索算法相比，该算法搜索到的是在帕累托边界的小、中、大等一系列不同分支的模型，因此在实际应用中更加灵活。

基于上述创新型研究，本文提出的神经网络结构搜索算法在多个图像理解任务公开数据集中都取得了当时最好或者领先的精度指标。其次，本文对模型结构的计算效率进行了重点关注，算法设计的网络结构在计算量和运行速度方面相较其他方法都有显著优势。最后，本文的算法和创新结果也可拓展到更多基于深度神经网络的机器学习任务中，以进一步提升目标任务的性能或降低其计算量。

英文摘要

Deep learning techniques, especially deep neural networks, have emerged as a powerful strategy for learning feature representations directly from data and led to remarkable breakthroughs in many image understanding tasks. Most of current employed deep neural networks are designed by human experts, which is time-consuming and error-prone. At the same time, with the rise of the Internet of Things (IoT), embedded devices such as smartphones and uncrewed vehicles demand low power and real-time implementation of image understanding algorithms. Although deep neural networks have a superior performance advantage in many image understanding tasks, the expensive computing cost makes it difficult to deploy into edge devices. Therefore, how to efficiently design neural network architecture according to different tasks is an important research topic. This thesis focuses on neural architecture search algorithms for efficient image understanding. The specific contributions of this thesis are summarized as follows:

This thesis proposes a cyclic differentiable architecture search algorithm. The differentiable architecture search algorithm has extremely high search efficiency compared with the traditional neural architecture search algorithm. In the search process, the differentiable search algorithm first finds an optimal architecture in a shallow search network, and then further builds a deep evaluation network based on the searched optimal architecture. This practice makes the search and evaluation processes independent to each other, resulting in poor performance. To alleviate this problem, this thesis proposes a cyclic differentiable architecture search algorithm. The algorithm constructs the information feedback path between the search network and the evaluation network. First, the search network generates an initial architecture for evaluation, and the weights of the evaluation network are optimized. Second, the architecture weights in the search network are further optimized by the label supervision in classification, as well as the regularization from the evaluation network through feature distillation. Repeating the above cycle results in a joint optimization of the search and evaluation networks and thus enables the evolution of the architecture to fit the final evaluation network. In this way, this thesis realizes the joint optimization of the search model and the evaluation model, alleviates the problem caused by the independent optimization of the search and evaluation networks in the differentiable architecture search, and verifies the effectiveness of the method in multiple datasets of image classification.

This thesis proposes an introspective distillation for the sampling-based one-shot neural architecture search algorithm. The sampling-based one-shot neural architecture search algorithm uses the weight sharing mechanism in the super-network, which greatly shortens the evaluation time of network architecture performance. However, this weight sharing leads to highly coupled sub-network parameters in the super-network, so that the sub-network in the super-network can not be fully trained. Therefore, the accuracy of the sub-network in the evaluation can not reflect its real performance, which affects the search efficiency. In this thesis, an introspective distillation neural architecture search algorithm is proposed. In the training process, the algorithm dynamically selects the sub-network with the best performance as the teacher model to further guide the training of super-network, so as to alleviate the low search efficiency caused by insufficient super-network training, and has made significant improvements in image classification, object detection and semantic segmentation.

This thesis proposes a text detection and tracking algorithm based on neural architecture search. Traditional neural architecture search algorithms generally search on image classification tasks and then transfer the searched network architectures to different image understanding tasks. Searching for a single non-target task results in a network architecture that is not optimal for a specific task and cannot leverage different supervision information for multiple tasks. For the video text detection task, this thesis proposes an end-to-end video text detection and tracking model. Its feature descriptor realizes the joint modeling of detection and tracking. On this basis, the backbone network architecture search is performed, which significantly improves the model performance.

This thesis proposes a multi-objective and multi-branch neural architecture search algorithm for image segmentation. The search space of the traditional neural architecture search algorithm often contains only a single branch model and the search algorithm takes performance as the only optimization goal, which will limit the application scenarios of the searched architectures and reduce the efficiency of the search algorithm. This thesis proposes a novel search space for image segmentation, which contains a lightweight self-attention module and a lightweight convolution module, and can search for various combinations between different resolution branches. At the same time, the algorithm is a multi-objective search algorithm, including model inference speed, model size and performance. Compared with the previous search algorithm, this algorithm searches for a series of models with different branches of small, medium and large on the Pareto boundary, so it can be more flexible in selecting models according to needs in actual scenarios.

These innovations contribute to leading results in multiple public datasets for image understanding tasks. At the same time, this thesis focuses on the computational efficiency of neural network architecture. The network architectures designed by the algorithms in this thesis have significant advantages over other methods in terms of computation cost. Finally, the methods in this thesis can also be extended to more machine learning tasks based on deep neural networks to further improve the performance of the target task or reduce its computation cost.

关键词

神经网络结构搜索循环可微分自蒸馏多任务搜索多目标搜索

语种

中文

文献类型

学位论文

条目标识符

http://ir.ia.ac.cn/handle/173211/48478

专题

中国科学院自动化研究所

推荐引用方式
GB/T 7714

俞宏远. 面向高效图像理解的神经网络结构搜索算法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
面向高效图像理解的神经网络结构搜索算法研（9743KB）	学位论文		开放获取	CC BY-NC-SA