CASIA OpenIR  > 毕业生  > 博士学位论文
基于迁移学习的小数据语音声学模型研究
易江燕
Thesis Advisor陶建华
2018-05-29
Degree Grantor中国科学院研究生院
Place of Conferral北京
Keyword迁移学习 小语种 口音自适应 声学模型 语音识别
Abstract
基于深度学习的声学模型促使语音识别取得重大突破,但是深度学习需要“大
数据”。然而,大多数语言的资源极为匮乏,即便是资源丰富的语言,因口音数据
的分布差异较大,导致某种口音数据较少。显然,小样本数据的收集和标注难度较
大且成本高昂。因此,解决此类问题具有重要的研究价值,亦存在巨大的挑战。本
文旨在基于深度学习的声学模型基础上,利用迁移学习的方法,从其他语言的大数
据中“迁移”知识帮助“小数据”的目标声学模型更好地学习。本文主要研究两种
情形下的“小数据”问题:不同语言的跨语言迁移和相同语言的跨口音迁移。针对
这两种情形,分别从瓶颈特征、模型参数和后验概率层面,提出了三种改进的迁移
学习方法,以提高“小数据”声学模型的性能。本文的创新点和主要贡献大致可概
括为以下三方面:
(1)国际上主流的瓶颈特征迁移方法存在两点缺陷:一是没有考虑源语言和目
标语言的相似性;二是多语言瓶颈特征中包含了语言相关的信息。为了尽量弥补这
些不足,本文提出了对抗多语言训练的瓶颈特征迁移方法。该方法的核心思想是以
提出的两种共享私有瓶颈模型作为源声学模型,然后在多语言训练准则中引入对
抗学习的策略,从而阻止源声学模型的共享层学习语言相关的特征。此外,在选择
源语言时,考虑了源语言和目标语言的相关性,选择与目标语言语系相同的语言作
为源语言。在IARPA Babel 数据集上的实验结果表明,基于对抗多语言训练的瓶
颈特征迁移方法与经典的瓶颈特征迁移方法相比,词错误率最多相对下降了8.9%。
(2)经典的跨语言参数迁移方法存在两方面的不足:一是忽略了多语言模型应
学习语言相关特征这一事实;二是共享隐层学习了很多语言依赖的特征。为了弥
补这些缺陷,本文提出了对语言对抗的模型参数迁移方法。本文将语言对抗策略
与迁移学习方法相结合训练对抗共享私有模型,此外,提出了两种新的迁移策略。
共享私有模型不仅能学习语言无关的特征,而且能捕捉语言依赖的信息。语言对
抗策略保证了共享层尽可能多地学习通用特征。语言无关的通用特征能显著地提
高目标声学模型的性能。在IARPA Babel 数据集上的实验结果表明,基于语言对
抗学习的模型参数迁移方法与经典的跨语言参数迁移方法相比,词错误率最多相
对下降了9.7%。
(3)若直接对基于联结时序分类(connectionist temporal classification,CTC)
的端到端声学模型进行参数调整,可能会破坏该模型的概率分布,从而引起过拟合
的问题。当自适应数据很少时,过拟合问题更为严重。为了避免此问题,本文提出
了基于CTC 正则口音自适应的后验概率迁移方法。这种方法的核心思想是在标准
的CTC 损失函数上增加一个正则化项,从而迫使自适应模型的后验概率分布尽可
能接近口音独立模型的后验概率分布。换言之,从口音独立模型中迁移后验概率
辅助自适应模型学习。在普通话方言口音公共数据集RASC863 和CASIA 上的实
验结果显示,本文所提方法不仅明显优于口音独立的基线模型,而且比L2 和线性
隐层网络(linear hidden network,LHN)自适应方法更有效,尤其是当自适应数
据只有1000 句时。
此外,本文不仅取得了阶段性的研究成果,而且相关研究成果已被成功应用于
语音识别系统中。就不同语言的跨语言迁移而论,我们利用本文提出的瓶颈特征
和模型参数迁移方法为粤语、上海话和蒙古语等小语种构建了语音识别系统。就
相同语言的跨口音迁移而论,我们利用本文所提CTC 正则的后验概率迁移方法对
声学模型进行自适应。所构建的普通话语音识别系统达到可实用的程度,目前已
应用于客服质检和对话系统中。

 

Other Abstract
Deep neural network based acoustic models have obtained significant improvement for automatic speech recognition (ASR) systems. However, deep neural network are highly effective when trained with large amounts of transcribed speech data. While such annotations may not be readily available for most languages. Moreover, resource-rich language like Mandarin also has annotation sparsity problem when ASR systems deal with different accented speech. Without any question, data collection and annotation are much time and money-consuming. Therefore, it is valuable and challenging to address this problem. The goal of this thesis is to use transfer learning to address the challenges for low-resource deep neural network based acoutic models. In other words, this thesis uses transfer learning to tranfer the knowledge from other languages to the novel target language. In particular, this thesis focuses on two transfer learning scenarios: transferring across different languages and transferring cross accents for the same language. This thesis proposes three novel transfer learning approaches to improve the performance of the low-resource language. The main contributions of this thesis are as follows.
(1) Multilingual bottleneck features are helpful to improve the perfomance of low-resource speech recognition systems. However, this method has some shortcomings. One is that the relevance of the source and the target languages is ignored. The other is that the bottleneck features may contain some unnecessary language specific information. This thesis proposes an adversarial multilingual training to alleviate this problem. Adversarial training is used to ensure that the shared layers can learn language invariant features. Moreover, the languages which have the same language family as the target language are selected as the source languages. Experiments are conducted on IARPA Babel datasets. The results show that the proposed adversarial multilingual BN model outperforms the baseline BN model by
up to 8.9% relative word error rate (WER) reduction.
(2) The target acoustic model trained using the cross-lingual parameters transferred from the shared hidden layermodel (SHL-Model) outperforms the model trained only using the target language data, especially when the target language is under low-resource condition. However, the SHL-Model only uses the hidden layers to learn shared features, which ignores the fact that some features are languagespecific. Moreover, the shared features may contain some unnecessary language dependent information. Therefore, this thesis proposes a language-adversarial transfer learning to alleviate this problem. The shared-private source models are proposed to learn both language-dependent and -independent features. Furthermore, adversarial learning is used to ensure that the shared layers of the shared-private model can learn more language-invariant features. Experiments are conducted on IARPA Babel dataset. The results show that the target model trained using the knowledge transferred from the adversarial shared-private model obtains further performance improvement by up to 9.7% relative WER reduction over the target model trained using the knowledge transferred from the SHL-Model.
(3) In general, directly adjusting the network parameters with a small adaptation set may lead to over-fitting. In order to avoid this problem, this thesis proposes a connectionist temporal classification (CTC) regularized adaptation method. The basic idea of this thesis is that a regularization term is added to the CTC training criterion. It forces the conditional probability distribution estimated from the
adapted model to be close to the accent independent model. In other words, the probability distribution is transferred from the accent independent model to the adapted model. Experiments are conducted on RASC863 and CASIA regional accented speech corpus. The results show that the proposed method obtains obvious improvement when compared with the strong baseline model. It also outperforms other adaptation methods, such as L2 and linear hidden network (LHN). 
In addition, we not only propose three novel transfer learning methods for low resource acoustic model, but also apply these methods to real speech recognition systems. In transferring across different languages scenario, we have bulit speech recognition systems for low-resource languages, such as Cantonese, Shanghai dialect and Mongolian etc. In transferring cross accents for the same language scenario, we use the CTC regularized adaptation method to perform accent adaptation for Mandarin acoustic model. The Mandarin speech recognition system is used as an important part of the customer service quality inspection and dialogue system respectively.

 

Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/21012
Collection毕业生_博士学位论文
Recommended Citation
GB/T 7714
易江燕. 基于迁移学习的小数据语音声学模型研究[D]. 北京. 中国科学院研究生院,2018.
Files in This Item:
File Name/Size DocType Version Access License
Thesis - 易江燕-2018053(2091KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[易江燕]'s Articles
Baidu academic
Similar articles in Baidu academic
[易江燕]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[易江燕]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.