英文摘要 | Image aesthetic visual assessment is a very challenging cognitive task in the research field of computer vision. Its ultimate goal to enable an artificial intelligent agent to perceive, analyze and make decisions based on the visual aesthetic of input images. Aesthetic visual assessment has great values in several applications, such as image retrieval, image generation, image processing, and machine emotional intelligence. During the past few decades, the researchers in this topic have achieved huge developments. Nevertheless, due to the inherent subjective characteristics, image visual aesthetic still remains largely an open and challenging research topic.
In this thesis, we attempt to solve some fundamental issues in the task: the low efficiency in exploiting image-level aesthetic annotations in visual aesthetic modeling, the lack of proper regularization methods in the task of image aesthetic assessment, and the inability to exploit the massive unlabeled images to learn aesthetic-aware visual features. We propose new and effective solutions to mitigate the aforementioned issues and obtain practical knowledge and insights. For practical applications, we also concern with two common specific image domains: face images and food photos.
Specifically, the technical contributions of this thesis are listed as follows:
1. We propose an effective attention-based multi-patch aggregation method. Because the image-level aesthetic annotation cannot provide full information to reason the visual aesthetic of input images, the mainstream methods resort to using other image attributes in assessing visual aesthetic. However, these auxiliary attributes rely on the design of expert or expensive manual annotations, which are not flexible and effective enough. To boost the efficiency in exploiting image-level aesthetic annotations, we propose an attention-based multi-patch aggregation method: during the optimization process. We assign each image patches with adaptive training weights in a data-driven and end-to-end manner, and allocate more computation sources on instances with low target confidences. Numerical results on AVA benchmark indicate that our approach can achieve the best performance, even better than the approaches that use auxiliary information.
2. We propose an effective method to exploit unlabeled images for aesthetic-aware visual features. Due to the high cost in aesthetic annotations, a desirable idea is to exploit massive unlabeled data to learn aesthetic-aware features. To our best knowledge, there is no research work on this valuable idea. Building on the relation between negative visual aesthetics and several image editing manipulations, we design an effective self-supervised learning scheme to learn aesthetic-aware features. For comparison purpose, we experiment with other typical self-supervised methods. Quantitative results indicate that our method outperforms other counterparts on three image aesthetic assessment benchmarks, even better than using the $1000$-way labels in ImageNet directly. Thus, we verify the promising values in using unlabeled data in image aesthetic assessment.
3. We establish the first large-scale dataset for food image aesthetic assessment, and propose an effective regularization strategy to improve generalization ability. At the present stage, just a few research work participate in the task of food image aesthetic assessment, which has great potential practical values. To support the research efforts in the topic, we establish the first large-scale dataset for food image aesthetic assessment, i.e., gourmet photography dataset (GPD), which contains 24,000 food photos and corresponding aesthetic annotations. We also design a simple yet effective regularization method to combat against over-fitting issue for better generalization ability. Extensive experiment prove the effectiveness of the GPD and our regularization method.
4. We manage to evaluate the quality of face alignment results without their corresponding ground truth in a learning-to-rank approach. The study of face alignment has been an area of intense research in computer vision. The outputs of various face alignment methods are often image-dependent or somewhat random. It is desirable to develop a method to evaluate the quality of face alignment without ground truth. Few research work has concerned this tough problem. We address this problem by designing a feasible feature extraction scheme to measure the quality of face alignment results. The feature is then used in various machine learning algorithms to rank different face alignment results. Experimental results show that our method is promising for ranking face alignment results and is able to pick good face alignment results. With our proposed evaluation model, we can enhance the overall performance of a face alignment method with a random strategy with moderate cost. |
修改评论