端到端中英混合语音识别方法研究
张帅
2022-05-17
页数108
学位类型博士
中文摘要

中英混合是指在交流过程中产生的中英文语言切换的现象。随着全球化的进行,中英混合表达已经成为了一种常见的交流方式。某些场景下的中英混合表达尤为多见,比如英文教学课堂、学术会议、跨国企业会议等。这种特殊的语言现象有效地降低了沟通成本,促进了不同文化之间的交流。随着中英双语者的数量不断增加,中英混合表达越发普遍,成为语音识别无法忽略的问题。尽管中英单语语音识别系统已进入实用阶段,然而仍然无法有效处理中英混合场景的语音。鉴于迫切的实际需求与语音识别性能不佳的问题,本文在端到端语音识别框架下,研究中英文混合语音识别问题。

本文围绕“提升中英混合语音识别方法的性能”这一目标,从多个角度分析影响识别性能的因素。在详细考察了现有研究工作的基础上,我们从语种信息,语言上下文,语义信息等几个方面展开研究。本研究的创新性算法并不局限于中英混合场景,均可扩展到其他语言的多语混合语音识别系统,具有良好的推广性;同时该研究拥有广阔的应用前景,可以推动诸多现实场景下的多语混合语音识别系统的发展,具有巨大的社会经济效益。具体来说,本文主要完成了三项创新性工作。

1.提出一种语种语音统一建模的语音识别方法。目前,利用语种信息辅助多语混合语音识别的方法存在诸多问题,比如模型结构复杂,计算代价大,语种信息提升语音识别性能不明显等。本文提出一种新的语种语音联合建模方法,通过在训练数据的目标文本里添加语种标签,利用神经网络转换器模型同步学习语音识别与语种识别任务。该方法不会增加额外的语种识别模块,也不会增加训练与推理过程的计算代价。另外,语种识别与语音识别任务高度耦合,语种信息可以有效提升中英混合语音识别任务的性能。在识别过程中,语种信息用来指导下一步的解码方向,进一步降低识别的错误率。

2.提出一种高效的中英混合语言上下文建模方法。针对中英混合语音识别存在的多语上下文建模复杂与训练数据缺乏的问题,提出一种语音语言解耦的端到端方法,提升了端到端模型的语言上下文建模能力。该方法将语音至文本的识别过程解耦为两部分,语音-音素过程和音素-文本过程。语音-音素过程使用连接时序分类(connectionist temporal classification,CTC)损失训练声学编码器,以音素作为建模单元。CTC具有输出单元之间独立的性质,该性质降低了中英混合与单语语音的不匹配性,有效利用单语语音数据。同时音素根据发音规则设计,可以更好地建模声学分布。音素-文本过程使用纯文本语料训练,学习音素序列至文本的映射关系,同时建模发音词典信息与语言上下文信息。两个过程独立训练,前者利用单语数据,后者利用纯文本数据,有效缓解中英混合语言上下文建模困难的问题,极大提升了中英混合语音识别性能。

3.提出一种有效的中英混合语义建模方法。根据中英混合表达的语义特点,提出了一种有效的语义信息建模方法,显著提升了中英混合语音识别的性能。具体来说,由于中英切换的随机性,同一句话可能对应多种不同的中英混合表达方式,但是这些不同的表达方式之间的语义具有一致性。基于这种特性,利用文本构造多种中英混合表达的同义句,据此设计语义一致性损失参与模型训练。这种语义损失既可以用于训练端到端语音识别模型,又可以用于训练神经网络语言模型。这两种方式可以共同提升中英混合语音识别系统的性能。

英文摘要

Chinese-English code-switching refers to the phenomenon of language switching between Chinese and English in the process of communication. With the progress of globalization, Chinese-English code-switching expression has become a common way of communication. Especially in some scenarios, such as English teaching classrooms, academic conferences, multinational enterprise conferences, etc. It effectively reduces the cost of communication and promotes understanding between different cultures. With the increasing number of Chinese and English bilinguals, code-switching expressions are becoming more and more common, which has become a problem that cannot be ignored in the automatic speech recognition (ASR) system. Although the monolingual ASR system is practical, it is still unable to effectively handle code-switching speech. In view of the urgent practical needs and the poor speech recognition performance, this paper studies the Chinese-English code-switching ASR problem under the framework of end-to-end speech recognition.

Focusing on the goal of "improving the performance of Chinese-English code-switching ASR methods", we analyze the factors affecting the recognition performance from multiple perspectives. Based on a detailed examination of the existing research work, we carry out research from several aspects such as language information, language context information, and semantic information. The innovative algorithms of this study are not limited to Chinese-English code-switching scenarios, and can be extended to code-switching ASR systems in other languages, which has good generalization; at the same time, this research has broad application prospects and can promote many real-world scenarios. And the development of the code-switching ASR system has huge social and economic benefits. Specifically, this paper mainly completes three innovative works.

1. Propose an ASR method for unified modeling of language and speech. At present, there are many problems in the methods of using language information to assist code-switching ASR, such as complex model structure, high computational cost, and insignificant improvement of ASR performance by language information. This paper proposes a new language and speech joint modeling method. After adding language labels to the target text of the training data, the neural network transducer model is used to simultaneously learn the tasks of ASR and language identification. This method does not add additional language identification modules, nor does it increase the computational cost of training and inference. In addition, language identification and ASR tasks are highly coupled, and language information can effectively improve the performance of Chinese-English code-switching ASR tasks. In the recognition process, language information is used to guide the next decoding direction, further reducing the recognition error rate.

2. Propose an efficient multilingual language context modeling method. Aiming at the problems of complex multilingual context modeling and lack of training data in Chinese-English code-switching ASR task, a pronunciation and language decoupled method is proposed, which improves the language context modeling capability of the end-to-end model. The method decouples the speech-to-text recognition process into two parts, the speech-phoneme process, and the phoneme-text process. The speech-phoneme process uses the connectionist temporal classification (CTC) loss to train an acoustic encoder with phonemes as modeling units. The independent nature of the output units of the CTC reduces the mismatch between code-switching and monolingual speech, and effectively utilizes monolingual speech data. At the same time, phonemes are designed according to pronunciation rules, which can better model the acoustic distribution. The phoneme-text process uses plain text data to train the model, learns the mapping relationship between phoneme sequences and texts, and models pronunciation dictionary information and language context information at the same time. The two processes are independently trained. The former uses monolingual data and the latter uses plain text data, which effectively alleviates the difficulty of modeling multilingual context and greatly improves the performance of Chinese-English code-switching ASR.

3. Propose an effective code-switching semantic modeling method. According to the semantic characteristics of Chinese-English code-switching expressions, an effective semantic information modeling method is proposed, which significantly improves the performance of Chinese-English code-switching ASR. Specifically, due to the randomness of switching between Chinese and English, the same sentence may correspond to a variety of different Chinese-English code-switching expressions, but the semantics between these different expressions are consistent. Based on this feature, synonyms with multiple code-switching expressions are constructed from text, and semantic consistency loss is designed according to these synonyms to participate in model training. This semantic loss can be used to train both the end-to-end ASR model and the neural network language model. These two methods can jointly improve the performance of the Chinese-English code-switching ASR system.

关键词端到端语音识别 中英混合 语种语音联合建模 多语上下文 语义一致性
语种中文
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/48812
专题多模态人工智能系统全国重点实验室_智能交互
毕业生_博士学位论文
推荐引用方式
GB/T 7714
张帅. 端到端中英混合语音识别方法研究[D]. 中国科学院自动化研究所. 中国科学院大学人工智能学院,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
博士论文终版.pdf(2551KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张帅]的文章
百度学术
百度学术中相似的文章
[张帅]的文章
必应学术
必应学术中相似的文章
[张帅]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。