不同样本环境下的图像标注关键技术研究

CASIA OpenIR > 数字内容技术与服务研究中心 > 版权智能与文化计算

	不同样本环境下的图像标注关键技术研究
	王方心
	2019-12-01
页数	120
学位类型	博士
中文摘要	随着数字网络信息技术的不断提高和多媒体技术的蓬勃发展，海量的多媒体数据正不断丰富和改变着人们的生活。对视觉和语义信息之间关系的建模一直是领域内关注的重点之一。自动图像标注技术通过对大规模图像以及其对应的多个语义标签进行分析，实现了对视觉对象的自动语义标注，并在图像和视频的语义检索等方面都得到了广泛地应用。然而，前人的工作主要集中于监督学习框架下的研究，由于实际生产应用过程中数据数量和质量的限制，其将无法完全满足不同样本环境下的标注需求。因此，本文着眼于不同样本环境下的图像标注关键技术研究，以期构建一个具备较强扩展能力的图像标注方案。其中，本文主要分析了传统样本、直推式零样本和归纳式零样本环境下的样本的特点，并针对上述样本环境，对应地构建了三种图像标注模型来改进相关技术。本文的主要贡献和创新点如下： 1. 提出一种基于预测路径的图像标注模型在传统样本环境下，源域和目标域的样本相似，标签完全一致。但是，由于在图像标注中，一个图像往往对应于多个标签，而利用深度学习技术提取的图像视觉特征则倾向于描述整个图像，这样就会造成一些重要目标特征的丢失。此外，目前大量应用的标注模型都假设标签之间是相互独立的，只考虑了图像与标签之间的关系。事实上，不同标签之间是存在着不同程度的关联的，而且利用标签之间的依赖关系可以进一步提升标注模型的性能。为此，本文提出一种应用于该样本场景下的基于预测路径的图像标注模型，一方面，它通过利用自注意力技术隐式地提取图像局部特征，较好地解决了目标特征的提取问题；另一方面，利用预测路径，该模型能够自适应地提取标签之间的依赖关系，并将其与图像和标签之间的依赖关系相结合，提高目标的标注精度和效率。 2. 提出一种基于语义中心对齐的直推式图像标注模型在直推式零样本环境下，源域和目标域的图像和标签均差异较大，并且在训练过程中，目标域图像和目标域标签之间的关系也未知，这些特点导致了直推式零样本环境下的领域漂移问题。为了解决该问题，本文提出一种基于语义中心对齐的直推式图像标注模型。本文通过建立从目标域标签语义特征到对应的视觉中心的映射来获取目标域标签对应的视觉特征，不仅保证了可靠的目标域映射关系，而且还可以减小预测过程中的噪声。此外，针对图像标注数据分布的特点，我们在获取目标域标签对应的视觉特征时，使用重叠聚类的方法生成视觉特征中心，并利用贪心匹配的方法找到目标域标签的语义特征与生成的视觉特征中心的对应关系，以此进一步减小领域漂移问题。 3. 提出一种基于语境化词向量的归纳式图像标注模型在归纳式零样本环境中，仅源域图像、源域标签和目标域标签已知，并且源域标签和目标域标签不完全相同，目标域图像的缺失使得领域漂移问题进一步加重。为此，本文提出了一种基于语境化词向量的归纳式图像标注模型。在该模型中，为了解决词向量的语境化问题，本文提出一种结合 WordNet 和 Node2Vec 生成语境化词向量的方法。针对目标域标签以及其与目标域图像关系缺失造成的领域漂移问题，本文将源域和目标域词向量看作是语义关系图中的一个顶点，利用图卷积网络建立起从目标域词向量到源域图像特征的映射，来提取目标域标签对应的视觉特征。此外，针对在该样本条件下，图卷积网络建模过程中出现的语义丢失问题，本文提出一种语义一致性损失函数，使得映射前后标签特征的语义关系保持不变。我们在多个图像标注的公开数据集上，对上述三个模型进行了测试。实验结果表明，相比前人的方法，本文的方法都取得了更好的效果，并且所提出的问题都得到了很大程度上的改善。
英文摘要	With the rapid development of the digital network and multi-media technologies, numerous media data have sharply influenced on human's life. Modeling the relationship between images and semantic tags is the main concern in the area. Automatic image annotation technology achieves the annotations of visual objects by building the relationship between an image and multiple tags, and is widely applied in the semantic search of images and videos. However, previous works mainly focus on the research under the supervised learning framework, and they cannot achieve the incremental learning since the quality and quantity of the data in the practice. Therefore, we focus on the research of technologies under different instance conditions, which is expected to build an general image annotation system. To this end, we analyze the characters of data in the traditional image annotation, transductive zero-shot image annotation and inductive zero-shot image annotation, respectively, and design three image annotation model correspondingly. Our contributions are as follows: 1. An image annotation method based on prediction path Under the instance conditions of traditional image annotation, data are very similar in both the source and the target domains, and they share the same tags. However, since an image is always correspond to multiple tags, features extracted from the deep learning network are prone to describe the whole image, which leads to the loss of the details. Besides, most of the previous work consider those tags as independent, and build the model only with the relations between images and tags. In fact, the relations among tags can help to further improve the performance. So, we propose a novel image annotation method, it applies the attention mechanism to extract the regional features implicitly, and combines the relations between images and tags, and the relations among tags described by the prediction path to improve the annotation performance. 2. An image annotation method based on semantic center alignment In the transductive zero-shot image annotation, not only the images and tags between different domains are very different, but the relationship between the images and tags are also unknown, which lead to the domain shift problem. To solve this problem, we propose a transductive zero-shot image annotation method based on semantic center alignment. In our model, we first build the map from the semantic features of the source domain to visual center, then we obtain the visual features corresponding to each tag of the target domain through overlapping clustering algorithm, and at last we build the relationship between the word vectors and those visual features to train the model. In this way, target data can also be trained, and the problem of domain shift will be alleviated. 3. An image annotation method based on the contextual word vectors In the inductive zero-shot image annotation, apart from the rigor conditions of the transductive zero-shot image annotation, target tags are also unavailable during training, which further increases the challenges of the domain shift. Therefore, we propose an inductive zero-shot image annotation model to solve the problem. In our model, we regard all the tags as the nodes of the graph in WordNet and extract the contextualized word vectors using Node2Vec model without the specific training corpus. After that, we feed these word vectors into graph convolutional network to build the relationship between images and tags. In this procedure, since we build the map from the tags of the target domain to the images of source domain, the transfer ability are improved. Besides, we also develop a semantic coherent loss to alleviate the semantic loss in the procedure. We test our models on different datasets and the experimental results demonstrate the better performances of the proposed methods, and all the problems concerned are alleviated.
关键词	图像标注预测路径零样本领域漂移词向量语境化
语种	中文
七大方向——子方向分类	多模态智能
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/28347
专题	数字内容技术与服务研究中心_版权智能与文化计算
推荐引用方式 GB/T 7714	王方心. 不同样本环境下的图像标注关键技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2019.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
Thesis_签名.pdf（7308KB）	学位论文		开放获取	CC BY-NC-SA