CASIA OpenIR  > 毕业生  > 博士学位论文
基于平行学习的艺术绘画图像描述算法研究
鲁越
2023-05-23
页数142
学位类型博士
中文摘要

随着信息技术的发展和公众对艺术观赏需求的增大,数字博物馆等在线平台逐渐提供了海量的绘画图像,方便了公众对绘画的获取和观赏。但与此同时,海量的绘画图像也为其系统化、智能化的管理带来了挑战。在此背景下,绘画图像描述任务应运而生,旨在根据绘画图像自动生成文本形式的人类自然语言,将图像模态转化为文本模态信息来描述绘画图像的内容和情感信息,在绘画图像智能大模型构建以及自动分类和检索等任务上具有重要的理论和应用价值。但目前,绘画图像描述的关键技术研究仍处于起步阶段,面临着标注数据稀缺、内容和情感信息难以提取等问题,亟需在训练数据构建、内容和情感特征学习等方面进行更深入的研究。

近年来,针对绘画图像描述等多模态感知任务,研究者们提出了诸多基于机器学习和深度学习的研究方法。其中,平行学习是基于平行系统思想提出的一种机器学习理论框架,设计了描述学习、预测学习和引导学习三种主要学习方法。具体地,平行学习理论通过描述学习构建与实际系统相对应的人工系统,利用预测学习在人工系统中进行计算实验,利用引导学习促进实际系统的数据生成和模型推理,为绘画图像描述中的训练数据构建、内容和情感特征学习等问题提供了可行方法。

面对绘画图像这一研究对象,针对绘画图像描述任务中存在的标注数据稀缺、内容特征难以提取和情感特征难以提取的问题,本文分别基于描述学习、预测学习和引导学习进行了研究。主要工作如下:

(1)基于描述学习与虚实独立正则化的绘画图像内容描述。为了应对绘画图像内容描述中训练数据稀少的问题,当前研究主要采用图文交叉索引和模板化填充等方法进行研究,其输出文字在灵活性和多样性上存在不足。描述学习能够根据真实数据构建人工系统,从而产生带标注的虚拟数据。基于描述学习的思想,本文提出了基于艺术风格迁移的虚拟绘画数据集生成方法,将艺术风格迁移作为人工系统中的计算实验方法,构建了用于生成虚拟绘画图像的人工系统,缓解了绘画图像训练数据稀缺的问题以及对真实绘画训练数据的依赖。同时,由于自然图像和虚拟绘画图像具有不同的风格,在进行联合训练时,二者在风格特征上的数据分布差异限制了模型性能。为此,本文提出了虚实独立正则化的模型训练方法,对自然图像和虚拟绘画图像使用独立的正则化层进行特征提取和训练。最后,在真实绘画数据集上对模型性能进行评估,实验结果表明,相比于几种主流的图像描述模型,该模型具有较优的性能指标,在BLEU4、CIDEr和SPICE指标上相比于基线模型分别提升了26.08%、2.78%和12.96%。

(2)基于预测学习与虚实语义对齐的绘画图像内容描述。绘画图像常具有抽象、变形和艺术化等特点,因而其内容特征难以提取,导致传统的绘画图像内容描述模型在描述性能和数据利用效率上存在不足。预测学习能够利用人工系统中的计算实验及其与实际系统间的交互来提升平行系统中模型的性能,为应对以上问题提供了契机。基于预测学习的思想,根据虚拟绘画图像与对应的自然图像间内容语义信息的一致性,本文设计了一种虚实语义对齐的损失函数,并提出了虚实语义对齐的训练方法,进而构建了一个虚实语义对齐的绘画图像内容描述模型,利用自然图像特征提升绘画图像内容描述的性能。最后,在无监督和小样本两个数据稀缺模式下对该模型进行测试。公开数据集上的实验结果表明,相比于主流的图像描述模型,该模型具有更优的绘画图像描述性能,在BLEU4、CIDEr和SPICE评价指标上分别提升了14.38%、17.58%和16.60%。同时,该模型在数据利用效率和可解释性上均有较好的效果。

(3)基于引导学习与多级别虚拟数据的绘画图像情感描述。绘画图像情感描述是在内容描述基础上对绘画图像的进一步理解,其面临着情感特征难以提取和训练数据不够充足的挑战,本文以肖像画这一情感丰富的绘画题材为例进行了研究。引导学习能够通过虚拟数据引导机器学习模型的训练过程来提升模型的性能。基于引导学习的思想,本文从特征提取和模型训练两方面进行了方法设计。首先,当前研究主要使用内容导向的特征提取方法,包含的情感信息不够充足。为此,本文提出了情感信息增强的绘画图像特征提取方法,将人脸表情特征和人体姿态特征与传统的物体特征进行融合,从而为绘画图像情感描述提供更全面的情感相关信息。然后,由于当前研究存在训练数据不够充足的问题,模型容易在句子级别和单词级别出现过拟合的问题。为此,本文提出了多级别虚拟数据引导的训练方法。该方法根据真实绘画数据和绘画情感描述模型参数,通过句子级别和单词级别两方面的反馈信息生成虚拟数据,促进绘画情感描述模型获得有效和鲁棒的绘画情感描述性能。通过公开数据集对该模型进行测试,相比于几种主流的图像描述模型,该模型表现出了更优的绘画图像情感描述性能,在BLEU4、CIDEr和SPICE指标上相比于基线模型分别提升了7.19%、26.30%和31.99%。此外,辅助验证实验结果表明了该模型对图像扰动具备一定的鲁棒性。

本文研究工作面向绘画图像描述任务,针对其中标注数据稀缺、内容和情感信息难以提取等问题,利用平行学习中的描述学习、预测学习和引导学习方法,分别提出了虚实独立正则化的绘画图像内容描述模型、虚实语义对齐的绘画图像内容描述模型和多级别虚拟数据引导的绘画图像情感描述模型,对绘画图像内容和情感信息生成了更加准确的自然语言文字描述。

英文摘要

With the development of information technology and the continuous improvement of the public's demand for art appreciation, online platforms such as digital museums gradually provide a large number of digital fine art paintings, which facilitates the public's acquisition and viewing of fine art paintings, but also brought challenges to their systematic and intelligent management. Under this background, the painting captioning task came into being. It aims to automatically generate human natural language in textual form based on painting images and convert image modalities into textual modal information to describe the content and emotion information of painting images. It has significant theoretical and application value in tasks such as intelligent large model construction for painting images and automatic classification and retrieval. However, the existing research on key technologies for painting captioning is still in its preliminary stage, facing the scarcity of annotated data and difficulty in extracting content and emotion information, and more in-depth research in training data construction, content and emotion feature learning is urgently needed.

In recent years, for multimodal perception tasks such as painting captioning, researchers have proposed many research methods based on machine learning and deep learning. Among them, parallel learning is a machine learning theoretical framework proposed based on the idea of parallel systems, which design three main learning methods, including descriptive learning, predictive learning, and prescriptive learning. Specifically, parallel learning theory constructs artificial systems corresponding to the real system through descriptive learning, employs predictive learning to conduct computational experiments in artificial systems, and then uses prescriptive learning to facilitate data generation and model reasoning for the real system. Parallel learning theory provides a potentially feasible approach to the problems of training data construction, content and emotion feature learning in painting captioning tasks.

To address the lack of annotated data and the difficulty in extracting content features and emotion features for painting captioning, this dissertation makes some explorations based on descriptive learning, predictive learning, and prescriptive learning. The main work is as follows:

(1) Painting content captioning based on descriptive learning and virtual-real independent normalization. To cope with the problem of insufficient training data in painting content captioning, current research has mainly been conducted using methods such as image-text cross-indexing and templated padding, whose output text lacks flexibility and diversity. The descriptive learning method can construct artificial systems to generate virtual annotated data based on real data. Based on the idea of descriptive learning, this dissertation proposes a virtual painting dataset generation method based on style transfer, which uses style transfer as a computational experimental method in an artificial system and constructs an artificial system for generating virtual painting images to alleviate the scarcity of annotated painting datasets and the reliance on real painting training data. Meanwhile, since natural images and virtual painting images have different styles, the difference in data distribution between the two in terms of style features limits the model performance when joint training is performed. Therefore, a training method of virtual-real independent normalization is proposed to extract and train features using independent regularization layers. Finally, the model is evaluated on a real painting dataset, and the experimental results show that the model has superior performances compared to several mainstream image captioning models, with 26.08%, 2.78%, and 12.96% improvement in BLEU4, CIDEr, and SPICE metrics, respectively, compared to the baselines.

(2) Painting content captioning based on predictive learning and virtual-real semantic alignment. Paintings usually have abstract, distorted, and artistic expressions, which makes it difficult to extract content features, resulting in insufficient performance and data utilization efficiency of traditional art painting image content captioning models. The predictive learning method can exploit computational experiments in artificial systems and their interactions with real systems to improve the performance of models in parallel systems, providing an opportunity to cope with the above problems. Based on the idea of predictive learning, according to the consistency of content semantic information between virtual painting images and corresponding natural images, this dissertation designs a  virtual-real semantic alignment loss function and proposes a training method with virtual-real semantic alignment, and then constructs a painting content captioning model with virtual-real semantic alignment to improve the performance of painting content captioning using natural image features. Finally, the model is tested in two data-scarce modes including unsupervised and few-shot modes. Experimental results on public datasets show that the model has superior performances compared to several mainstream image captioning models, with 14.38%, 17.58%, and 16.60% improvement in BLEU4, CIDEr, and SPICE metrics, respectively, compared to the baselines. At the same time, the model has promising results in terms of data utilization efficiency and model interpretability.

(3) Painting emotion captioning based on prescriptive learning and multi-level virtual data. The painting emotion captioning is a further understanding of paintings based on content captioning, and it faces challenges such as difficulty in emotion feature extraction and insufficient training data. This dissertation studies the emotionally rich theme of portraiture as an example of painting emotion captioning. Prescriptive learning can improve the performance of machine learning models by guiding the training process with virtual data. Based on the idea of prescriptive learning, the method is designed in terms of both feature extraction and model training. First, current research mainly uses content-oriented feature extraction methods that contain insufficient emotion information. To this end, this dissertation proposes an emotion-information-enhanced feature extraction method for painting images, which fuses facial expression features and human pose features with traditional object features to provide more comprehensive emotion-related information for painting emotion captioning. Then, the current study suffers from the problem of insufficient training data, which makes the model prone to overfitting at the sentence level and word level. To this end, this dissertation proposes a multi-level virtual-data-guided training method, which generates virtual data from both sentence-level and word-level feedback information based on real painting data and painting emotion captioning model parameters, facilitating the painting emotion captioning model to obtain effective and robust painting emotion captioning performance. The model was tested on a publicly available dataset, and it demonstrates better performance in painting emotion captioning compared to several mainstream image captioning models, with improvements of 7.19%, 26.30%, and 31.99% in BLEU4, CIDEr, and SPICE metrics, respectively, compared to the baselines. In addition, auxiliary verification experiments show that the model has a certain degree of robustness to image corruptions.

The research work in this dissertation is oriented to the painting image captioning task and addresses the problems of sparse annotated data and difficulty in extracting content and emotion information. Using descriptive learning, predictive learning, and prescriptive learning methods in parallel learning, we propose a painting content captioning model based on virtual-real independent normalization, a painting content captioning model based on predictive learning and virtual-real semantic alignment, and a painting emotion captioning model based on prescriptive learning and multi-level virtual data, respectively, to generate more accurate natural language textual descriptions of painting image content and emotion information.
 

关键词平行学习 艺术绘画 图像描述 内容描述 情感描述
学科领域人工智能
学科门类工学
语种中文
七大方向——子方向分类图像视频处理与分析
国重实验室规划方向分类人工智能基础前沿理论
是否有论文关联数据集需要存交
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/52101
专题毕业生_博士学位论文
推荐引用方式
GB/T 7714
鲁越. 基于平行学习的艺术绘画图像描述算法研究[D],2023.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
201818014628084鲁越.pd(15730KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[鲁越]的文章
百度学术
百度学术中相似的文章
[鲁越]的文章
必应学术
必应学术中相似的文章
[鲁越]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。