CASIA OpenIR  > 模式识别国家重点实验室
面向数据失配的鲁棒性声学建模方法研究
刘斌
2020-05-29
页数130
学位类型博士
中文摘要
语音是人与机器最自然的交互方式之一,让机器能够理解语音中的含义,是人们长期以来的追求。深度学习模型强大的特征学习和建模能力,显著提高了以语音唤醒、语音识别为代表的语音交互技术的性能,大大提升了语音交互的体验。然而,在实际应用中,当训练集与测试集在数据分布上具有较大差异时,这种数据失配的问题会造成语音系统性能的急剧下降,阻碍了语音技术从实验研究走向实际应用的发展趋势。声学模型是语音识别和语音唤醒的核心组成部分,面对数据失配的问题,鲁棒性声学建模也是提升语音系统鲁棒性的重要方法,其研究具有重要的理论意义和应用价值。
数据失配主要表现在:语音的信噪比不同、混响时间不同、噪声类型多样,造成声学环境的失配问题;说话人的口音不同、语速不同、年龄不同,尤其是儿童和成人的发音方式差异较大,造成说话人失配问题;训练样本类别不均衡,造成数据失配问题。鲁棒性声学建模的研究目标是尽可能消除或缩小训练数据与测试数据之间的失配,提高语音系统的性能。本文在充分把握鲁棒性声学建模领域的基本理论和前沿方法的基础上,瞄准实际应用中的数据失配的问题进行探索,深入研究鲁棒性声学建模方法。论文的主要内容和创新点包括:
(1)提出了一种基于深度对抗训练的鲁棒性声学建模方法。由于语音受到噪声干扰,造成声学环境的失配,导致语音识别的性能下降。对此通常采用语音增强进行降噪,而此方法一般使用手工设计目标函数,其在抑制噪声的同时,会引入语音频谱的失真,生成的频谱过度平滑,缺少真实语音的精细结构,往往会降低语音识别的性能。此外,由于语音增强部分通常与语音识别部分的优化目标不同,无法朝最终语音识别目标进行优化,导致了次优解。所提的基于深度对抗训练的声学建模方法,结合了生成对抗网络以及声学模型的优势,通过深度对抗的策略,有效减小了噪声环境语音数据和真实训练数据的分布差异,提升了声学模型的鲁棒性。对抗训练方式不需要手工设计损失函数,相对语音增强方法,该框架没有增加计算的流程和复杂度,而且不需要一一对应平行的带噪数据和纯净数据,可作为通用训练框架提升现有声学模型的噪声鲁棒性。
(2)提出了一种基于联合对抗增强训练的鲁棒性端到端声学建模方法。在声学环境失配的情况下,端到端语音识别系统的性能急剧下降。对此问题,提出了一种联合对抗增强训练方法,将深度神经网络对声学环境的感知能力和对抗学习对增强语音分布差异的建模能力融合起来,构建对抗学习和端到端声学模型的优化目标,将前端语音增强和后端语音识别有机地统一成一体框架,提高了系统的噪声鲁棒性;另一方面,对抗训练策略避免了人工设计损失函数的掣肘,可从带噪的语音信号中捕获潜在的结构特征。这种对语音识别目标、语音增强目标和对抗目标的联合优化方式,可以更好地学习到适合于语音识别任务的鲁棒性参数表示。
(3)提出了一种基于对抗特征学习的鲁棒性儿童声学建模方法。目前用成人语音数据训练的声学模型,对儿童语音的识别效果难以让人满意,原因在于儿童和成人的发音方式差异较大,造成说话人失配问题。儿童的发音多变,说话人间的差异性较大,而且儿童语料较少、难以收集,因而增加了儿童语音识别的难度。如何利用成人语音数据,提升儿童语音识别系统的性能,面临巨大的挑战。提出了一种对抗特征学习方法,分别从低层特征层面和高层特征层面入手,将成人语音数据迁移到儿童语音识别中,有效提高了迁移的性能;所提方案通过对语音识别目标和两个特征层面对抗损失的联合优化,提取出说话人发音方式无关的声学特征,进而提升儿童语音识别的鲁棒性和实用性。
(4)提出了一种基于焦点损失的鲁棒性语音唤醒声学建模方法。语音唤醒是人机交互的重要组成部分,其功能是从连续的音频流中检测特定的唤醒词,目标是在低误唤醒率的情况下提供较高的检测精度,同时要满足较小的存储需求和计算复杂度;基于神经网络的语音唤醒系统在训练过程中面临着严重的类别不均衡问题,加之唤醒词收集成本很高,数据量比背景语音少,因而造成训练精度下降。提出了基于焦点损失的语音唤醒训练方案,能够在训练过程中自动降低简单样本的比重,关注困难样本,有效地解决了类别不平衡问题,能够利用所有训练数据,提高模型的检测精度。此外,针对误唤醒问题,提出了双边沿检测方案,抓住了叠词唤醒数据的特性,与单阈值方法相比,有效降低了误唤醒率,提高了系统的实用性。
英文摘要
Speech is one of the most natural ways of human-computer interaction. And it is a long-term pursuit for people to make machines understand the meaning of speech. The deep learning models have the powerful feature learning and modeling capabilities, which significantly improve the performance of speech interaction technologies represented by speech keyword spotting and speech recognition and greatly improve the experience of speech interaction. However, the problem of data mismatch in the practical application will cause a dramatically degradation in the performance of the speech system when the data distributions of the training set and the test set are greatly different. The problem hinders the trend of speech technology from experimental research to practical application. Acoustic model is the core component of speech recognition and speech keyword spotting. Robust acoustic modeling is an important technique to improve the robustness of speech system and cope with the data mismatch problem, which has the important theoretical and application value.
The data mismatch is mainly divided into three aspects. The acoustic environment mismatch is caused by the diverse noise types, different signal-to-noise ratio and reverberation. The speaker mismatch is caused by the differences in speaker accent, speed and age, especially the differences in pronunciation between children and adults. The imbalance of training samples results in data mismatch. The goal of robust acoustic modeling is to eliminate or narrow the mismatch between the training data and the test data and improve the performance of the speech system. Based on the basic theories and the advanced research progress of robust acoustic modeling, we focus on the problem of data mismatch in practical applications and devotes our efforts to the robust acoustic modeling methods. We summarize our works and contributions as follows:
(1) We propose a robust acoustic modeling method based on deep adversarial training. Speech is corrupted by noise, resulting in the mismatched acoustic environment for training and testing and the performance degradation of speech system. The commonest way is to use a well-designed speech enhancement approach as a front-end of ASR. However, speech enhancement methods generally use the signal level criteria, which may cause speech distortions while suppressing noise, resulting in the degradation of ASR performance. Moreover, it fails to optimize towards the final objective since the speech enhancement part is usually distinct from the recognition part, which leads to a suboptimal solution for ASR. The proposed deep adversarial training framework concentrates the strengths of both adversarial learning and acoustic model. Through the deep adversarial training, the distribution difference between the noisy speech data and the real clean data is effectively reduced, which improves the robustness of the acoustic model. The use of adversarial training circumvents the limitation of hand-engineering loss functions. Without the need for one-to-one correspondence between the true clean and generated samples, our method greatly simplifies the training process compared with the speech enhancement method. Thus, it can be used as a general training framework to improve the noise robustness of existing acoustic models.
(2) We propose a robust end-to-end acoustic modeling method based on the jointly adversarial enhancement training. In the case of acoustic environment mismatch, the performance of end-to-end speech recognition system drops dramatically. We propose a jointly adversarial enhancement training method, which can integrate the ability of deep neural network to perceive the acoustic environment and the ability of adversarial learning to model the distribution of enhanced speech. With the joint optimization objectives of the end-to-end acoustic model and adversarial learning, the front-end speech enhancement and back-end speech recognition are integrated into a unified framework, which can improve the robustness of end-to-end speech recognition system. The use of adversarial training circumvents the limitation of hand-engineering loss functions and captures the underlying structural characteristics from the noisy signals. With the joint optimization of the recognition, enhancement and adversarial loss, the proposed scheme is expected to learn noise robust representations for the recognition task automatically.
(3) We propose a robust children acoustic modeling method based on the adversarial feature learning. The acoustic model trained with adult speech data is not satisfactory in recognizing children's speech. Because children and adults have different pronunciation styles, causing the speaker mismatch problem. Children's pronunciation is changeable and there are large differences among children speakers. Moreover, children's corpus is rare and difficult to collect, which increases the difficulty of children speech recognition. It is a big challenge to use adult speech data to improve the performance of children speech recognition system. We propose an adversarial feature learning method and transfer the knowledge on the lower feature level and the higher feature level respectively, which improves the performance of the transfer learning. Through the joint optimization of the speech recognition target and two feature levels transfer loss, the acoustic features irrelevant to the speaker's pronunciation are extracted, which improves the robustness and practicability of children's speech recognition.
(4) We propose an acoustic modeling method of robust keyword spotting (KWS) based on focal loss. KWS system constitutes a critical component of human-computer interfaces, which detects the specific keyword from a continuous stream of audio. The goal of KWS is providing a high detection accuracy at a low false alarm rate while having small memory and computation requirements. The DNN-based KWS system faces a large class imbalance during training because the amount of data available for the keyword is usually much less than the background speech, which overwhelms training and leads to a degenerate model. We use the focal loss for the KWS system training, which can automatically down-weight the contribution of easy samples during training and focus the model on hard samples. The proposed method effectively solves the class imbalance and allows us to utilize all data available, which improves the detection accuracy of the model. In addition, we propose a double-edge-triggered detecting method for the repeated keyword, which grasps the characteristics of the repeated keyword. The method significantly reduces the false alarm rate and improves the practicability of the system relative to the single threshold method.
关键词鲁棒性声学建模 语音识别 对抗学习 语音唤醒
语种中文
七大方向——子方向分类语音识别与合成
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/39093
专题模式识别国家重点实验室
推荐引用方式
GB/T 7714
刘斌. 面向数据失配的鲁棒性声学建模方法研究[D]. 中科院自动化研究所. 中国科学院大学,2020.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
Thesis - 副本.pdf(2027KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[刘斌]的文章
百度学术
百度学术中相似的文章
[刘斌]的文章
必应学术
必应学术中相似的文章
[刘斌]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。