CASIA OpenIR  > 毕业生  > 博士学位论文









With the development of information technology and the continuous improvement of the public's demand for art appreciation, online platforms such as digital museums gradually provide a large number of digital fine art paintings, which facilitates the public's acquisition and viewing of fine art paintings, but also brought challenges to their systematic and intelligent management. Under this background, the painting captioning task came into being. It aims to automatically generate human natural language in textual form based on painting images and convert image modalities into textual modal information to describe the content and emotion information of painting images. It has significant theoretical and application value in tasks such as intelligent large model construction for painting images and automatic classification and retrieval. However, the existing research on key technologies for painting captioning is still in its preliminary stage, facing the scarcity of annotated data and difficulty in extracting content and emotion information, and more in-depth research in training data construction, content and emotion feature learning is urgently needed.

In recent years, for multimodal perception tasks such as painting captioning, researchers have proposed many research methods based on machine learning and deep learning. Among them, parallel learning is a machine learning theoretical framework proposed based on the idea of parallel systems, which design three main learning methods, including descriptive learning, predictive learning, and prescriptive learning. Specifically, parallel learning theory constructs artificial systems corresponding to the real system through descriptive learning, employs predictive learning to conduct computational experiments in artificial systems, and then uses prescriptive learning to facilitate data generation and model reasoning for the real system. Parallel learning theory provides a potentially feasible approach to the problems of training data construction, content and emotion feature learning in painting captioning tasks.

To address the lack of annotated data and the difficulty in extracting content features and emotion features for painting captioning, this dissertation makes some explorations based on descriptive learning, predictive learning, and prescriptive learning. The main work is as follows:

(1) Painting content captioning based on descriptive learning and virtual-real independent normalization. To cope with the problem of insufficient training data in painting content captioning, current research has mainly been conducted using methods such as image-text cross-indexing and templated padding, whose output text lacks flexibility and diversity. The descriptive learning method can construct artificial systems to generate virtual annotated data based on real data. Based on the idea of descriptive learning, this dissertation proposes a virtual painting dataset generation method based on style transfer, which uses style transfer as a computational experimental method in an artificial system and constructs an artificial system for generating virtual painting images to alleviate the scarcity of annotated painting datasets and the reliance on real painting training data. Meanwhile, since natural images and virtual painting images have different styles, the difference in data distribution between the two in terms of style features limits the model performance when joint training is performed. Therefore, a training method of virtual-real independent normalization is proposed to extract and train features using independent regularization layers. Finally, the model is evaluated on a real painting dataset, and the experimental results show that the model has superior performances compared to several mainstream image captioning models, with 26.08%, 2.78%, and 12.96% improvement in BLEU4, CIDEr, and SPICE metrics, respectively, compared to the baselines.

(2) Painting content captioning based on predictive learning and virtual-real semantic alignment. Paintings usually have abstract, distorted, and artistic expressions, which makes it difficult to extract content features, resulting in insufficient performance and data utilization efficiency of traditional art painting image content captioning models. The predictive learning method can exploit computational experiments in artificial systems and their interactions with real systems to improve the performance of models in parallel systems, providing an opportunity to cope with the above problems. Based on the idea of predictive learning, according to the consistency of content semantic information between virtual painting images and corresponding natural images, this dissertation designs a  virtual-real semantic alignment loss function and proposes a training method with virtual-real semantic alignment, and then constructs a painting content captioning model with virtual-real semantic alignment to improve the performance of painting content captioning using natural image features. Finally, the model is tested in two data-scarce modes including unsupervised and few-shot modes. Experimental results on public datasets show that the model has superior performances compared to several mainstream image captioning models, with 14.38%, 17.58%, and 16.60% improvement in BLEU4, CIDEr, and SPICE metrics, respectively, compared to the baselines. At the same time, the model has promising results in terms of data utilization efficiency and model interpretability.

(3) Painting emotion captioning based on prescriptive learning and multi-level virtual data. The painting emotion captioning is a further understanding of paintings based on content captioning, and it faces challenges such as difficulty in emotion feature extraction and insufficient training data. This dissertation studies the emotionally rich theme of portraiture as an example of painting emotion captioning. Prescriptive learning can improve the performance of machine learning models by guiding the training process with virtual data. Based on the idea of prescriptive learning, the method is designed in terms of both feature extraction and model training. First, current research mainly uses content-oriented feature extraction methods that contain insufficient emotion information. To this end, this dissertation proposes an emotion-information-enhanced feature extraction method for painting images, which fuses facial expression features and human pose features with traditional object features to provide more comprehensive emotion-related information for painting emotion captioning. Then, the current study suffers from the problem of insufficient training data, which makes the model prone to overfitting at the sentence level and word level. To this end, this dissertation proposes a multi-level virtual-data-guided training method, which generates virtual data from both sentence-level and word-level feedback information based on real painting data and painting emotion captioning model parameters, facilitating the painting emotion captioning model to obtain effective and robust painting emotion captioning performance. The model was tested on a publicly available dataset, and it demonstrates better performance in painting emotion captioning compared to several mainstream image captioning models, with improvements of 7.19%, 26.30%, and 31.99% in BLEU4, CIDEr, and SPICE metrics, respectively, compared to the baselines. In addition, auxiliary verification experiments show that the model has a certain degree of robustness to image corruptions.

The research work in this dissertation is oriented to the painting image captioning task and addresses the problems of sparse annotated data and difficulty in extracting content and emotion information. Using descriptive learning, predictive learning, and prescriptive learning methods in parallel learning, we propose a painting content captioning model based on virtual-real independent normalization, a painting content captioning model based on predictive learning and virtual-real semantic alignment, and a painting emotion captioning model based on prescriptive learning and multi-level virtual data, respectively, to generate more accurate natural language textual descriptions of painting image content and emotion information.

关键词平行学习 艺术绘画 图像描述 内容描述 情感描述
GB/T 7714
鲁越. 基于平行学习的艺术绘画图像描述算法研究[D],2023.
文件名称/大小 文献类型 版本类型 开放类型 使用许可
201818014628084鲁越.pd(15730KB)学位论文 限制开放CC BY-NC-SA
所有评论 (0)
