CASIA OpenIR  > 毕业生  > 博士学位论文
基于任务关联特征建模的音频分类方法研究
戴佳
学位类型工学博士
导师刘文举
2017-06
学位授予单位中国科学院研究生院
学位授予地点北京
关键词音频分类 音频特征 特征建模 分类器 深度学习
摘要

随着大数据时代的发展,音频内容分析对海量数据的存储和利用有着越来越重要的作用,但目前存在的声学特征已经无法很好地表征日渐复杂的音频内容,也无法满足人们越来越高的分类精度的要求。对此,本文在认真总结前人关于音频分类与特征变换的基础上,提出了一系列的基于任务关联的音频特征建模方法,通过对低层特征进行建模,可以得到更符合当前分类任务的高层特征。本文的主要工作和创新点如下:
(1) 针对环境声分类任务,提出了一种通过发掘音频细节来描述音频内容的特征建模方法。我们认为音频的内容可以通过一些表达音频细节要素的描述单元(“音频单元”)来表述,这些“音频单元”通过不同的组合规则形成不同类别的音频,通过识别这些“音频单元”的分布,可以区分不同的音频类别。本文在传统bag-of-word模型的框架之上,提出了一种新的生成“音频单元”的方法,采用自组织映射神经网络来表征“音频单元”,相比传统方法采用k-means聚类得到的聚类中心来表征而言,这种生成方法可以通过足够多的训练步骤和较好的训练策略来尽量避免生成的“音频单元”集合出现局部最优问题。接着,本文提出了一种基于多元投票法的策略来生成描述“音频单元”分布的概率直方图,相比现存的矢量量化方法而言,该方法对边界点更鲁棒。
(2) 针对小数据分类任务,提出了一种能提高小数据音乐流派分类性能的深度学习特征建模方法。我们需要大量的标注数据来训练一个鲁棒的深度神经网络模型,虽然周围充斥着大量的数据,但是大数据标注的成本过高,在很多任务中一般无法获得足够多的标注数据。数据量过少,容易造成深度模型过拟合,虽然可以通过一些无监督的方法利用无标注的数据,但是在许多任务中,这种方法的效果并不明显。针对标注数据缺乏的问题,本文提出了一种融合半监督的迁移学习和bottleneck DNN对音频特征进行建模的方法,利用无标注的集外数据进行半监督地训练迁移学习模型,然后结合bottleneck DNN模型进行特征建模,实验结果在测评数据库上得到了最高的结果。
(3) 针对音乐流派分类任务,提出了一种融合时序信息特征和统计特征的段特征建模方法。音频的时序信息是音频文件的一个重要特性,对音频内容的表征有着重要的作用。为了提取包含时序信息的音频特征,本文采用长短时记忆递归神经网络(LSTM-RNN)进行时序建模,但时序建模存在着一些缺陷,导致采用多数投票得到的段正确率提高不明显。为此,本文首先尝试将时序特征和原始特征进行融合建模,但直接将带有时序信息的帧级别时序特征与不带时序信息的帧级别的原始特征融合,将会造成混淆,影响特征的区分性。为此,本文提出融合时序信息特征和统计段特征的段特征建模方法,对帧特征进一步提取统计段特征再进行融合,得到最终的融合段特征。实验证明,融合的段特征相比原来的特征更具有区分性,在实验中取得了较好的分类效果。
(4) 针对音乐流派分类任务,提出了一种利用集外数据学习“音频单元”的段特征建模方法。一段音乐信号完美的表征形式一定包含着许多的重要特性,如节拍、旋律、响度、歌手信息、所用乐器或者其他特征描述等等,但是目前这些特性的提取(如旋律)并没有一个很好的方法。本文并不直接地学习这些特性的表达,而是从另外的角度出发,用更详细的描述单元(“音频单元”)来描述。本文采用高斯分布来拟合并生成不同的“音频单元”,假设每种音乐类别可以用一个高斯混合模型描述,而高斯混合模型中每个高斯分量都可以看作一个“音频单元”。在基于I-vector的特征建模框架中,采用基于高斯混合模型的全局背景模型(GMM-UBM)进行无监督地学习“音频单元”。在理论上,如果有足够多的数据和高斯个数,高斯混合模型可以拟合任意的数据分布。因此为了使学到的“音频单元”更充分,本文结合了Multilingual模型,通过网上下载的大量的无标签的音乐数据训练GMM-UBM模型,得到最终的基于Multilingual I-vector的“音频单元”表征的特征,实验证明通过该方法得到的特征能很好地表征音乐信号内容,实验的成功对网上无标签数据的利用有着重要的意义。

其他摘要

With the development of big data, the analysis of the audio content is more and more important, especially for the storage and the utilization of large multimedia data. But as the audio data is becoming more and more complex, the existing audio features cannot well represent the audio content, and using these features for classification also cannot meet the requirement of high classification accuracy. In this thesis, based on the comprehensive investigation on the state-of-art methods of audio classification and feature transforming. A series of algorithms based on task related feature modeling have been proposed, and the corresponding experiments have also been conducted. By feature modeling, the low-level features are mapped into another feature space, then we can get the high-level feature which are more suitable for target classification task. The main contributions and novelties of this thesis include:
(1) For environment sound classification task, we propose a novel audio ``element unit'' learning method based feature modeling. We think that the audio content can be described by some micro descriptors which called ``element units'', and different combination of these ``element units'' can represent different kinds of audio. By recognizing the distribution of these ``element units'', we can classify the audio data. For automatic discovering these ``element units'', we propose a method based on bag-of-word model to generate these ``element units''. We use the neural nodes of self-organizing feature map (SOFM) network to represent the ``element units'', it is better than using the centers of clusters by k-means algorithm. It can ameliorate the local optimization problem by enough training steps and good training methods. Then we propose a n-competition encoding strategy to form the probability histogram which can describe the distribution of ``element units''. Comparing to the vector quantization encoding strategy, n-competition encoding strategy is more robust to the boundary points.
(2) For low resource classification tasks, we propose a feature modeling method for improving the classification accuracy of low resource database. Usually, large amount of data is needed to train a large and robust deep neural network, while only a small amount of labeled data can be acquired in practice. Though we have a lot of unlabeled date, but we cannot use them as the data annotation needs so much time and money. Although we can use some unsupervised methods, but for many tasks they cannot work. The lacking of data will lead to the problem of over fitting. For solving this problem, we propose a model which combine transfer learning and bottleneck DNN for feature modeling. We use large out-of-domain unlabeled data for the semi-supervised training of transfer neural network, then combine the idea of transfer leaning and the bottleneck DNN for extracting bottleneck feature. The results have a big achievement, and outperform the state-of-art best performance on the test database.
(3) For music genre classification task, we propose the combination of sequential knowledge and statistical feature based segment feature modeling. Many researchers think that the sequential knowledge is an important character for audio data, and can benefit to the representation of audio content. In order to get the representation which contains sequence information, we use the Long Short-term Memory Recurrent Neural Network (LSTM RNN) for sequence modeling. However, because of the disadvantage of sequence training, it will lead to that the segment feature cannot improve much when we use majority vote to get the segment labels. To solve this problem, we combine the frame sequential knowledge feature and frame initial feature. But as the initial feature doesn't contain sequence information and the sequence feature contains sequence information, we cannot simply fuse them. So the fusional segment feature methods is proposed. We get the statistic segment features of frame features, and then fuse them. Experimental results show that the fusional feature can improve the classification accuracy and can perform better than sequence modeling method using LSTM RNN.
(4) For music genre classification task, we propose the segment feature modeling method using ``element unit''. The prefect representation of music signals must contain a lot of characters, such as beats, rhythm, loudness, singer information, musical instrument information and so on. But it is difficult to capture these characters. In this paper, we capture these characters of music signal by using some micro descriptors which called ``element units''. We use the GMM (Gaussian Mixture Model) model to learn and generate the ``element units''. In this model, a GMM represents a music kind, and each Gaussian component in GMM represents one ``element unit''. The model is based on I-vector, we use GMM-UBM (GMM based Universal Background Model) for the unsupervised learning of these ``element units''. In theory, if we have enough data and Gaussian mixtures, GMM can be fitted to any data distribution. So, in addition to learn enough ``element units'', the Multilingual I-vector model is proposed. The experimental results shows that the multilingual I-vector based ``element unit'' learning method can better represent the music signal.

 

学科领域模式识别与智能系统
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/14680
专题毕业生_博士学位论文
作者单位中科院自动化研究所
推荐引用方式
GB/T 7714
戴佳. 基于任务关联特征建模的音频分类方法研究[D]. 北京. 中国科学院研究生院,2017.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
答辩后修改20170601-基于任务关联(4900KB)学位论文 暂不开放CC BY-NC-SA请求全文
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[戴佳]的文章
百度学术
百度学术中相似的文章
[戴佳]的文章
必应学术
必应学术中相似的文章
[戴佳]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。