面向情境化语音识别的建模方法研究

CASIA OpenIR > 毕业生 > 博士学位论文

	面向情境化语音识别的建模方法研究
	韩明伦
	2023-05
页数	136
学位类型	博士
中文摘要	作为人机交互中的关键技术，自动语音识别技术已经历经了70多年的发展，在人们生产生活的许多角落发挥着自己特有的作用。近十年来，随着数据量的积累、算力的提升、技术的迭代，深度神经网络被广泛地应用于自动语音识别技术，并展现了非同寻常的应用价值。从早期的基于深度神经网络和隐马尔可夫模型的混合语音识别模型到如今流行的端到端语音识别模型，语音识别模型的性能逐渐变得更强，其构建过程变得更加简化，其部署过程变得更加方便。目前，得益于统一结构和全局联合优化等特性，端到端语音识别模型在学术界和工业界都成为了主要的研究方向。虽然端到端语音识别模型在模型性能、模型构建与部署等方面存在诸多优势，但是端到端语音识别模型的训练方式使得其各个组件深度耦合，不易整合利用额外数据或者外部知识。因此，端到端语音识别模型在特定情境下，仍然较难充分利用情境相关的各类数据或者外部知识，从而不能较好地适应情境的变化。在这样的背景下，如何在端到端识别模型中整合情境相关的各类信息以构建情境化的语音识别系统，成为了一个极具前景的研究课题。本论文聚焦于情境化语音识别。以基于连续整合发放机制的模型作为语音识别基础建模方法，本文针对情境化语音识别中基础建模方法和情境化建模方法中存在的问题进行了递进式的研究，旨在促使端到端语音识别模型从孤立地使用语音输入来分析和理解语音的研究范式向着综合情境线索来分析理解语音的情境化研究范式迈进。具体的，本文为解决“基础建模方法的建模能力不足”、“情境化建模方法可控性差，性能不足”、“情境化建模方法对于多模态情境信息的利用能力不足”三个主要问题，完成了四项创新性的工作。 1. 基于层次化知识蒸馏的语音识别基础建模方法：端到端语音识别模型中各个模块的深度耦合导致其难以利用海量的额外文本数据，限制了其表征能力与建模能力的进一步提高。同时，基础表征能力的不足也限制了情境化建模的性能上限。为增强模型的基础表征能力，本文提出一种层次化知识蒸馏策略，用于将大规模文本数据预训练的语言模型中的知识迁移到语音识别模型中。该方法利用对比学习将语言知识迁移到语音识别模型的声学层次上，利用回归损失将语言知识迁移到语音识别模型的语言学层次上。实验表明，同时将语言学知识迁移到语音识别模型的两个层次上能够大幅提高基础表征能力和性能。 2. 基于协同解码的语音识别情境化建模方法：由于采用了端到端建模方法，基于注意力机制的情境化建模方法很难控制情境建模的影响程度，这给情境化语音识别的实际应用带来了困难。本文提出一种被称为“协同解码”的情境化建模方法。该方法在训练阶段将语音识别模型和情境化模型解耦分离，在推理阶段将两者输出以可控的方式结合。该方法不仅继承了使用神经网络进行端到端情境化建模的优越表征能力，还具备了更好的情境影响可控性和系统构建的灵活性。实验表明，协同解码方法显著提升了语音识别模型的情境化适应能力，提高了特定情境下语音识别性能。 3. 基于细粒度知识选择的语音识别情境化建模方法：基于注意力机制的情境化建模方法通常将情境相关的短语向量化，基于大量的短语向量表示提取综合性的情境信息表示并使用综合性的情境信息表示来影响最终的符号层面的预测。在这一类方法中，相似的情境短语可能会使得模型产生混淆并输出模糊不确定的情境信息表示，进而导致性能恶化。基于协同解码方法，本文提出一种细粒度知识选择机制，提高模型对于相似短语的区分能力和情境化性能。细粒度知识选择机制从短语粒度到符号粒度分层次地处理了情境信息，并最终提取了细粒度的情境信息表示。实验表明，细粒度知识选择机制能够有效缓解了情境化建模能力不足的问题，进一步提高了协同解码情境化建模方法的性能。 4. 基于多模态感知融合的语音识别情境化建模方法：多数先前的情境化建模方法探索了使用语言类线索或者视觉类线索来提高语音识别性能。本文提出一种基于多模态感知融合的情境化建模方法，该方法能够整合一种或者多种模态的情境线索。具体的，多模态感知模块首先将多种模态的线索编码为向量表示。然后，语音识别模型的解码器通过注意力机制将多种模态信息的表示整合到模型中。最后，解码器基于语音输入和全部情境信息进行预测。实验表明，多模态感知融合方法有效提高了多模态场景下的语音识别性能。
英文摘要	As a key technology of human-computer interaction, automatic speech recognition (ASR) has undergone more than 70 years of development and plays a unique role in many aspects of human production and life. In the past decade, with the accumulation of data, the improvement of computing power, and the iteration of technology, deep neural networks have been widely applied in ASR, demonstrating extraordinary application value. From the early hybrid ASR models based on deep neural networks and hidden markov models to the current popular end-to-end ASR models, the performance of ASR models has gradually become stronger, the construction process has become more simplified, and the deployment process has become more convenient. Currently, thanks to the characteristics of unified structure and global joint optimization, end-to-end ASR models have become the main research direction in both academia and industry. Although end-to-end ASR models have many advantages in terms of model performance, construction, and deployment, the training method of end-to-end ASR models makes its components deeply coupled and difficult to integrate and utilize additional data or external knowledge. Therefore, in some specific contexts, end-to-end ASR models still have difficulty utilizing various context-specific data or external knowledge, thus cannot adapt well to changes in context. In this context, how to integrate various context-specific information in the end-to-end ASR model to build a contextualized speech recognition system has become a highly promising research topic. This thesis focuses on the contextualized speech recognition (CSR). Employing the continuous integrate-and-fire based model as the ASR basic modeling method, the thesis progressively studies the problems in both basic modeling methods and contextual modeling methods in the field of CSR, with the aim of advancing the research paradigm from the isolated use of speech input for analysis and understanding of speech to the integration of contextual cues for speech analysis and understanding. Specifically, to address three main problems: the insufficient modeling capability of the basic modeling methods, the poor controllability and performance of the contextual modeling methods, and the inadequate utilization of multimodal contextual information in contextual modeling methods, the thesis proposes four innovative approaches. 1. The ASR basic modeling method based on hierarchical knowledge distillation: The deep coupling of the modules in the end-to-end ASR model makes it challenging to use massive amounts of additional text data, limiting the further improvement of representational capability and modeling capability. Meanwhile, the inadequate representational capability also restricts the upper limit of the performance of contextual modeling. To enhance the basic representational capability of the ASR model, this thesis proposes a hierarchical knowledge distillation strategy that transfers the knowledge from the pre-trained language model trained on the massive text data to the ASR model. This method utilizes contrastive learning to transfer the linguistic knowledge into the acoustic level of the ASR model and utilizes regression loss to transfer the linguistic knowledge into the linguistic level of the ASR model. The experiments show that transferring linguistic knowledge to both levels of the ASR model can significantly improve the basic representational capability and the performance. 2. The ASR contextual modeling method based on collaborative decoding: Due to the end-to-end modeling, it is challenging for the attention-based contextual modeling methods to control the degree of the impact of contextual modeling, which brings difficulties to the practical applications of CSR. This thesis proposes a method called collaborative decoding for contextual modeling. This method decouples the ASR model and the contextual model during the training phase, and combines their outputs in a controllable way during inference. This method not only inherits the superior representation capability of deep neural networks in end-to-end contextual modeling but also has better control over the impact of the contextual information and better flexibility in system construction. The experiments show that the collaborative decoding method significantly enhances the contextual adaptation ability of the ASR model and improves the ASR performance in specific contexts. 3. The ASR contextual modeling method based on fine-grained knowledge selection: Attention-based Contextual modeling methods usually encode context-specific phrases into vector representations, extract comprehensive contextual information representation based on all phrase representations, and use the comprehensive contextual information representation to influence the final token-level prediction. In this kind of method, similar contextual phrases may confuse the model and make the model output ambiguous contextual information representation, thus leading to performance degradation. This paper proposes a fine-grained knowledge selection mechanism based on the collaborative decoding method to improve the model's discriminative ability for similar phrases and contextual performance. The fine-grained knowledge selection mechanism hierarchically processes contextual information from the phrase level to the token level and finally extracts the fine-grained representation of contextual information. The experiments show that the fine-grained knowledge selection mechanism effectively alleviates the problem of insufficient contextual modeling and further improves the performance of the collaborative decoding method. 4. The ASR contextual modeling method based on multimodal perception fusion: Most previous contextual modeling methods have investigated using linguistic or visual cues to improve ASR performance. This thesis proposes a multimodal perception fusion method that can integrate one or more different modal cues. Specifically, the multimodal perception module first encodes multimodal cues into vector representations. Then, the decoder of the ASR model incorporates multimodal contextual cues using attention mechanisms. Finally, the decoder conducts predictions based on speech inputs and all contextual cues. The experiments show that the proposed multimodal perception fusion method can effectively improve ASR performance under multimodal scenarios.
关键词	Automatic Speech Recognition Contextualized Speech Recognition Speech Recognition Customization Multimodal Speech Recognition Continuous Integrate-and-Fire Mechanism
学科领域	语音处理 ; 语音处理
学科门类	工学 ; 工学::控制科学与工程 ; 工学 ; 工学::控制科学与工程
语种	中文
是否为代表性论文	是
七大方向——子方向分类	语音识别与合成
国重实验室规划方向分类	语音语言处理
是否有论文关联数据集需要存交	否
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/52057
专题	毕业生_博士学位论文
推荐引用方式 GB/T 7714	韩明伦. 面向情境化语音识别的建模方法研究[D],2023.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
201818014628025韩明伦.p（9191KB）	学位论文		限制开放	CC BY-NC-SA