CASIA OpenIR  > 毕业生  > 博士学位论文
基于深度学习的多媒体数据感知与计算研究
杨小汕
2016-05-28
学位类型工学博士
中文摘要随着移动互联网的发展,越来越多的智能设备被连接到互联网上。这极大
地简化了用户在网络上获取和分享信息的途径。在此背景下,互联网上产生了
大量由用户上传到Web2.0社交网站的媒体数据,例如图像、文本和视频等。这
些多媒体数据的传播加快了信息的流通,连接了全世界各地的用户,降低了
沟通和交流的成本。但对于用户和社交媒体网站来说,由于网络多媒体数据
具有(1)跨平台,(2)多模态,(3)底层特征与高层语义之间存在“ 语义鸿
沟”,(4)噪声大、信息不完备等特点,管理、检索和分析这些数据仍然是没
有被彻底解决的难题。面对上述网络多媒体数据的复杂特性,为了提取和发掘
这些数据中的有用信息就需要更为有效的数据感知和计算方法。但目前已有的
多媒体数据分析方法依然借助上下文标注信息或者采用人工设计的特征,无法
达到对数据内容真正感知和理解的目的。
本文从网络多媒体数据的跨平台、多模态、语义鸿沟和噪声大这四个特点
出发,以近年来在图像、语音等非结构化数据识别中取得突破性进展的深度神
经网络为技术基础(主要涉及消噪自编码器,卷积神经网络,循环神经网络),
为网络多媒体数据分析学习更有效的特征表示,进而让计算机更好地理解网络
多媒体数据内容。并将这些多媒体数据表示方法应用到社会事件的识别与发现
中。与已有方法相比,本文的主要贡献体现在如下6 个方面:
1. 跨平台特征表示学习。把网络多媒体数据的平台差异问题公式化为迁移
学习中不同领域的特征分布差异问题,并利用提升深度学习来减小这种
分布差异。我们的提升深度学习算法主要是结合了传统提升(Boosting)
算法和深度特征学习算法的思想。随着提升算法的迭代,根据样本分布
不断选择新的样本训练新的特征表示,从而得到更能减小源平台数据与
目标平台数据之间差异的共同特征表示。在多次迭代结束后,结合多种
特征表示以及多个弱分类器对测试样本进行分类。
2. 多模态跨平台特征表示学习。提出一种融合多模态和跨平台特性的统一
特征学习框架。通过在同一层消噪自编码器中加入模态相关性约束和平
台一致性约束,有效提高特征学习的鲁棒性。带有多模态与跨平台约束
的消噪自编码器可以用边缘化的方式有效求解。
3. 图片语义属性学习。针对多媒体数据底层特征与高层语义之间的语义鸿
沟问题,提出一种基于深度卷积神经网络的相对属性学习算法。在神经
网络框架下,图片的视觉特征是在表示相对属性值的排序损失函数的约
束下训练得到。排序损失函数包含对比性约束和相似性约束,分别对应
于属性不同的图像对以及属性相同的图像对。
4. 事件视频语义属性学习。为了给视频中的特定事件构建最有效的视觉属
性特征,提出一种视觉语义属性的自动学习算法。利用视频的文本描述
进行词组分析与分割,计算词组的语义粘滞性自动挖掘语义属性。利用
网络辅助图片数据集,计算语义属性的视觉表示力,得到视觉语义属性。
采用提升和消噪自编码器选择最有利于事件识别的视觉语义属性。基于
多特征表示和多个属性分类器得到测试视频的视觉语义特征表示。
5. 事件视频语义特征学习。提出基于视频和文本描述学习从视频生成语义
特征向量的映射函数。为了达到这一目的,提出嵌入式卷积神经网络把
视频和对应文本映射到同一个语义特征空间,在语义特征空间中,相关
的视频和文本的语义特征向量之间的距离被最小化。嵌入式卷积网络由
两支分别用于视频特征表示和文本特征表示的神经网络构成。这种方法
在视频训练样本有限的情形下有很好的效果。
6. 网络图片中的社会事件分析。在图片的事件分析中引入时间信息,把事
件分析公式化为一个时序的结构化预测问题。借助循环神经网络和卷积
神经网络得到事件的时序特征表示,减小类内差异。提出基于离散条件
随机场的用于多类别事件识别的判别式结构化事件模型,减轻类间混淆。
提出基于连续条件随机场的用于不常见事件发现的单类别结构化事件模
型,缓解样本稀缺问题。在事件模型中,条件随机场作为损失函数在统
一的框架下来约束循环神经网络和卷积神经网络的训练。
英文摘要With the impressive progress of mobile Internet, more and more smart devices
have been connected directly or indirectly to the Internet, which successfully
facilitates the information sharing and propagation. As a result, a huge number of
social multimedia content have been uploaded by users on the Web 2.0 sites, such
as texts, photos and videos. The propagation of these multimedia data speeds
up the information exchange, connects the users all over the world and reduces
the cost for communication. However, due to the intrinsic complex characteristics
of the multimedia data, including (1) cross-platform, (2) multi-modality,
(3) semantic gap between low level features and the high level semantics, (4)
noisiness and incompleteness, it is still extremely difficult to organize, retrieve or
analyze them by users or the social media sites. With these characteristics, to
extract and mine useful information from these data, effective data sensing and
understanding methods are needed. However, existing methods mainly rely on
contextual information and human-designed features. They still cannot achieve
the goal of truly sensing and understanding of the multimedia data.
On account of the four characteristics of multimedia data, this thesis aims
to learn more effective feature representations for multimedia data analysis
and understanding based on deep neural networks (including denoising autoencoders,
convolutional neural networks and recurrent neural networks) which have
achieved breakthrough progress in the recognition tasks of unstructured data including
images and speeches. We also apply the feature representations to social
event recognition and discovery. Compared with the previous methods, the main
contributions of the thesis can be concluded as follows.
1. Cross-platform feature representation learning. The discrepancy of the multimedia
data on different platforms is formulated as the domain shift between
different domains as in transfer learning. This kind of discrepancy is
decreased with boosted deep feature learning. In the iteration of boosting
framework, features are learned based on the selected samples according to
the sample weights. The derived shared features can reduce the discrepancy
among the data distributions of the source platform and the target platform.
After the iteration is finished, multiple features and weak classifiers are
used to predict a new sample.
2. Multi-modal cross-platform feature learning. A unified feature learning
framework which considers the multi-modality and the cross-platform characteristics
simultaneously is proposed. More effective and robust features
can be learned by adding the modality correlation constraint and the platform
consistent constraint to the denoising autoencoders. This kind of
denoising autoencoders can be learned effectively through marginalization.
3. Semantic attributes learning from photos. A relative attributes learning
algorithm is proposed based on convolutional neural networks to relieve
the semantic gap issue. The visual features of photos and the ranking loss
function for the relative attribute are learned in a unified neural network
based framework. The ranking loss function consists of the contrastive
constraint for the photos with different attribute values and the similar
constraint for the photos with the same attribute values.
4. Semantic attributes learning from event videos. To create the most effective
visual attributes for each event in videos, an automatic visual semantic attributes
learning method is proposed. The semantic attributes are created
from the video-related text descriptions based on the semantic stickiness.
The visual semantic attributes are obtained by the visual representativeness
computed from auxiliary photos on Flickr. The most effective visual
attributes for event recognition in videos are selected by boosting and denoising
autoencoders. Multiple features and multiple attribute classifiers
are used to represent a new video.
5. Semantic feature learning in event videos. The function which can generate
semantic features for videos is learned based on the videos and their related
text descriptions. To achieve this goal, the embedding convolutional neural
networks (ECNN) model is proposed which can map the videos and texts
into a common feature space where distances between videos and their
related texts are minimized. The ECNN model consists of two sides of
neural networks for videos and texts respectively. This model is especially
useful when training videos are scarce.
6. Social event analysis in Internet photos. By introducing the time stamps
of photos, event analysis is formulated as a temporal structured prediction
task. The temporal visual features of event are learned based on the recurrent
neural networks and convolutional neural networks to reduce the
intra-class discrepancy. A discriminative structured event modeling (DSEM)
is proposed based on discrete conditional random field to reduce
the inter-class confusion. A one-class structured event modeling (OSEM)
is proposed based on continuous conditional random field to relieve the
sample scarcity issue. In both the DSEM and the OSEM, the conditional
random field is used as the loss function to constrain the training of the
recurrent neural networks and the convolutional neural networks.
关键词多媒体 特征表示 语义属性 深度学习 社会事件分析
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/11759
专题毕业生_博士学位论文
作者单位中国科学院自动化研究所
推荐引用方式
GB/T 7714
杨小汕. 基于深度学习的多媒体数据感知与计算研究[D]. 北京. 中国科学院研究生院,2016.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
基于深度学习的多媒体数据感知与计算研究.(29606KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[杨小汕]的文章
百度学术
百度学术中相似的文章
[杨小汕]的文章
必应学术
必应学术中相似的文章
[杨小汕]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。