基于深度学习的多媒体数据感知与计算研究

CASIA OpenIR > 毕业生 > 博士学位论文

	基于深度学习的多媒体数据感知与计算研究
	杨小汕
	2016-05-28
学位类型	工学博士
中文摘要	随着移动互联网的发展，越来越多的智能设备被连接到互联网上。这极大地简化了用户在网络上获取和分享信息的途径。在此背景下，互联网上产生了大量由用户上传到Web2.0社交网站的媒体数据，例如图像、文本和视频等。这些多媒体数据的传播加快了信息的流通，连接了全世界各地的用户，降低了沟通和交流的成本。但对于用户和社交媒体网站来说，由于网络多媒体数据具有（1）跨平台，（2）多模态，（3）底层特征与高层语义之间存在“ 语义鸿沟”，（4）噪声大、信息不完备等特点，管理、检索和分析这些数据仍然是没有被彻底解决的难题。面对上述网络多媒体数据的复杂特性，为了提取和发掘这些数据中的有用信息就需要更为有效的数据感知和计算方法。但目前已有的多媒体数据分析方法依然借助上下文标注信息或者采用人工设计的特征，无法达到对数据内容真正感知和理解的目的。本文从网络多媒体数据的跨平台、多模态、语义鸿沟和噪声大这四个特点出发，以近年来在图像、语音等非结构化数据识别中取得突破性进展的深度神经网络为技术基础（主要涉及消噪自编码器，卷积神经网络，循环神经网络），为网络多媒体数据分析学习更有效的特征表示，进而让计算机更好地理解网络多媒体数据内容。并将这些多媒体数据表示方法应用到社会事件的识别与发现中。与已有方法相比，本文的主要贡献体现在如下6 个方面： 1. 跨平台特征表示学习。把网络多媒体数据的平台差异问题公式化为迁移学习中不同领域的特征分布差异问题，并利用提升深度学习来减小这种分布差异。我们的提升深度学习算法主要是结合了传统提升（Boosting）算法和深度特征学习算法的思想。随着提升算法的迭代，根据样本分布不断选择新的样本训练新的特征表示，从而得到更能减小源平台数据与目标平台数据之间差异的共同特征表示。在多次迭代结束后，结合多种特征表示以及多个弱分类器对测试样本进行分类。 2. 多模态跨平台特征表示学习。提出一种融合多模态和跨平台特性的统一特征学习框架。通过在同一层消噪自编码器中加入模态相关性约束和平台一致性约束，有效提高特征学习的鲁棒性。带有多模态与跨平台约束的消噪自编码器可以用边缘化的方式有效求解。 3. 图片语义属性学习。针对多媒体数据底层特征与高层语义之间的语义鸿沟问题，提出一种基于深度卷积神经网络的相对属性学习算法。在神经网络框架下，图片的视觉特征是在表示相对属性值的排序损失函数的约束下训练得到。排序损失函数包含对比性约束和相似性约束，分别对应于属性不同的图像对以及属性相同的图像对。 4. 事件视频语义属性学习。为了给视频中的特定事件构建最有效的视觉属性特征，提出一种视觉语义属性的自动学习算法。利用视频的文本描述进行词组分析与分割，计算词组的语义粘滞性自动挖掘语义属性。利用网络辅助图片数据集，计算语义属性的视觉表示力，得到视觉语义属性。采用提升和消噪自编码器选择最有利于事件识别的视觉语义属性。基于多特征表示和多个属性分类器得到测试视频的视觉语义特征表示。 5. 事件视频语义特征学习。提出基于视频和文本描述学习从视频生成语义特征向量的映射函数。为了达到这一目的，提出嵌入式卷积神经网络把视频和对应文本映射到同一个语义特征空间，在语义特征空间中，相关的视频和文本的语义特征向量之间的距离被最小化。嵌入式卷积网络由两支分别用于视频特征表示和文本特征表示的神经网络构成。这种方法在视频训练样本有限的情形下有很好的效果。 6. 网络图片中的社会事件分析。在图片的事件分析中引入时间信息，把事件分析公式化为一个时序的结构化预测问题。借助循环神经网络和卷积神经网络得到事件的时序特征表示，减小类内差异。提出基于离散条件随机场的用于多类别事件识别的判别式结构化事件模型，减轻类间混淆。提出基于连续条件随机场的用于不常见事件发现的单类别结构化事件模型，缓解样本稀缺问题。在事件模型中，条件随机场作为损失函数在统一的框架下来约束循环神经网络和卷积神经网络的训练。
英文摘要	With the impressive progress of mobile Internet, more and more smart devices have been connected directly or indirectly to the Internet, which successfully facilitates the information sharing and propagation. As a result, a huge number of social multimedia content have been uploaded by users on the Web 2.0 sites, such as texts, photos and videos. The propagation of these multimedia data speeds up the information exchange, connects the users all over the world and reduces the cost for communication. However, due to the intrinsic complex characteristics of the multimedia data, including (1) cross-platform, (2) multi-modality, (3) semantic gap between low level features and the high level semantics, (4) noisiness and incompleteness, it is still extremely difficult to organize, retrieve or analyze them by users or the social media sites. With these characteristics, to extract and mine useful information from these data, effective data sensing and understanding methods are needed. However, existing methods mainly rely on contextual information and human-designed features. They still cannot achieve the goal of truly sensing and understanding of the multimedia data. On account of the four characteristics of multimedia data, this thesis aims to learn more effective feature representations for multimedia data analysis and understanding based on deep neural networks (including denoising autoencoders, convolutional neural networks and recurrent neural networks) which have achieved breakthrough progress in the recognition tasks of unstructured data including images and speeches. We also apply the feature representations to social event recognition and discovery. Compared with the previous methods, the main contributions of the thesis can be concluded as follows. 1. Cross-platform feature representation learning. The discrepancy of the multimedia data on different platforms is formulated as the domain shift between different domains as in transfer learning. This kind of discrepancy is decreased with boosted deep feature learning. In the iteration of boosting framework, features are learned based on the selected samples according to the sample weights. The derived shared features can reduce the discrepancy among the data distributions of the source platform and the target platform. After the iteration is finished, multiple features and weak classifiers are used to predict a new sample. 2. Multi-modal cross-platform feature learning. A unified feature learning framework which considers the multi-modality and the cross-platform characteristics simultaneously is proposed. More effective and robust features can be learned by adding the modality correlation constraint and the platform consistent constraint to the denoising autoencoders. This kind of denoising autoencoders can be learned effectively through marginalization. 3. Semantic attributes learning from photos. A relative attributes learning algorithm is proposed based on convolutional neural networks to relieve the semantic gap issue. The visual features of photos and the ranking loss function for the relative attribute are learned in a unified neural network based framework. The ranking loss function consists of the contrastive constraint for the photos with different attribute values and the similar constraint for the photos with the same attribute values. 4. Semantic attributes learning from event videos. To create the most effective visual attributes for each event in videos, an automatic visual semantic attributes learning method is proposed. The semantic attributes are created from the video-related text descriptions based on the semantic stickiness. The visual semantic attributes are obtained by the visual representativeness computed from auxiliary photos on Flickr. The most effective visual attributes for event recognition in videos are selected by boosting and denoising autoencoders. Multiple features and multiple attribute classifiers are used to represent a new video. 5. Semantic feature learning in event videos. The function which can generate semantic features for videos is learned based on the videos and their related text descriptions. To achieve this goal, the embedding convolutional neural networks (ECNN) model is proposed which can map the videos and texts into a common feature space where distances between videos and their related texts are minimized. The ECNN model consists of two sides of neural networks for videos and texts respectively. This model is especially useful when training videos are scarce. 6. Social event analysis in Internet photos. By introducing the time stamps of photos, event analysis is formulated as a temporal structured prediction task. The temporal visual features of event are learned based on the recurrent neural networks and convolutional neural networks to reduce the intra-class discrepancy. A discriminative structured event modeling (DSEM) is proposed based on discrete conditional random field to reduce the inter-class confusion. A one-class structured event modeling (OSEM) is proposed based on continuous conditional random field to relieve the sample scarcity issue. In both the DSEM and the OSEM, the conditional random field is used as the loss function to constrain the training of the recurrent neural networks and the convolutional neural networks.
关键词	多媒体特征表示语义属性深度学习社会事件分析
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/11759
专题	毕业生_博士学位论文
作者单位	中国科学院自动化研究所
推荐引用方式 GB/T 7714	杨小汕. 基于深度学习的多媒体数据感知与计算研究[D]. 北京. 中国科学院研究生院,2016.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
基于深度学习的多媒体数据感知与计算研究.（29606KB）	学位论文		限制开放	CC BY-NC-SA