会议场景智能语音处理技术研究

	会议场景智能语音处理技术研究
	范志赟
	2022-08-16
页数	104
学位类型	博士
中文摘要	随着信息技术与移动互联网行业的发展，会议已经成为了人们工作和学习中不可或缺的一部分。尤其是线上会议，在这个疫情席卷全球的特殊时期得到了极大的推广。语音作为会议的最主要的交流手段之一，其相关的语音处理技术对会议的展开与记录起着至关重要的作用。因此，近年来围绕会议场景展开的各项语音处理技术吸引了大量的研究人员。在深度学习发展的推动下，诸如像语音识别、说话人识别与语音合成等语音处理技术虽然已经取得了长足的进步，但会议场景作为语音处理技术最为复杂的场景之一仍然存在许多问题有待解决。其中，已有的说话人转换点检测和语音识别模型在处理快速交替的多轮对话或者说话人交叠的情况时无法给出理想的结果，多个语音处理模块在进行整合时存在流程复杂，误差累积传递等问题。这些都极大地影响会议场景中语音处理技术的体验。本文主要关注于会议场景中关键的语音处理技术的改进和创新，主要创新结果如下: 1. 针对会议场景的说话人不确定性，本文提出一种语音识别模型的说话人自适应方法。会议场景下，语音识别模型需要处理训练中未见过的说话人。由于会议场景有标注数据的稀缺，语音识别模型在对未知说话人进行解码时存在一定的性能衰减。本文针对如何提升语音识别模型对未知说话人的鲁棒性的问题提出了有效的方案。本文以 speech-transformer 模型为端到端语音识别模型的代表展开了说话人自适应方案的设计，提出了利用注意力机制在线学习 i-vector 的线性组合作为“软”的说话人表示，并利用该说话人表示对 speech-transformer 模型进行说话人意识训练。本文对该说话人自适应方法的最优配置进行了大量的实验探索，也展示了该方法在不同条件下的效果。进一步地，本文尝试使用 wav2vec 2.0 的预训练模型来学习说话人表示，证实了该无监督预训练方式在说话人表示学习上的可行性和有效性，并利用该模型提取的声纹替代 i-vector 进行 speech-transformer 模型的说话人自适应训练。最后，在真实录制的会议场景数据集上验证了本文提出的语音识别说话人自适应方法的有效性。 2. 针对会议场景中长语音分割问题，本文提出一种基于序列转化的说话人转换点检测方法。本文设计基于差异量的整合发放(difference-based integrate-and-fire, DCIF)机制完成序列级说话人转换点检测模型的搭建。DCIF 能够将帧级别的说话人表示依据其自动检测说话人转换点进行切分，并生成每个切分片段的说话人表示。因此，DCIF 连接帧级别的说话人编码器与段级别的说话人分类器，完成输入端特征序列到输出端说话人身份序列的转换。序列级别说话人转换点检测模型的训练只依赖于说话人身份序列的标签而不需要准确的说话人切换的时间戳标注，其相比此前的大部分说话人转换点检测方法对人工标注的要求大大降低。此外，为了整个序列级说话人转换点检测模型更好地收敛，本文也提出使用一些训练的技巧:(1)使用长度归一化缓解 DCIF 输出的段级别说话人表示的数值波动大的问题;(2)使用多标签焦点损失使模型能够处理多说话人交叠的情况，并能够缓解训练样本中正负样本不均衡的问题;(3)编码器使用时延神经网络(time delay neural network，TDNN)进行降帧率的操作，从而降低 DCIF 机制的学习难度。最后，本文在真实录制的会议数据集上设置对比实验展示了序列级说话人转换点检测方案的有效性以及相比帧级别方法的优势。 3. 针对会议场景中说话人转换点检测与语音识别级联存在的缺陷，本文提出一种联合建模两个模块的方法。会议场景中的语音数据为包含多个说话人的长时语音。目前，大多数语音识别方法都只针对单句单说话人的语音，且语音的时长不能过长。因此，针对会议场景语音，说话人转换点检测模块与语音识别模块级联的方案为可行方案之一。然而，说话人转换点检测和语音识别级联方式存在的流程复杂、误差累积传递等缺陷。本文提出一种联合建模两个模块的方法。整个框架下，语音识别部分采用基于连续整合发放(continuous integrate-and-fire, CIF)机制的语音识别模型结构 [1]，说话人转换点检测部分利用 CIF 机制以及信息权重共享的方式获取到字符级别的说话人表示，并融合语义和说话人表示两方面信息在字符的声学边界处进行字符级的说话人转换点检测。另一方面，字符级别的说话人表示被用于语音识别部分的说话人自适应训练。联合模型能够直接处理包含多个说话人的语音，输出字符序列以及说话人转换的时间。最后，实验表明在真实录制的会议数据集上，以这种简单高效的方式联合建模语音识别与说话人转换点检测，能同时在两个任务上都获得了收益。
英文摘要	With the development of information technology and mobile internet, meetings have become an indispensable part of people’s work and study. In particular, online meetings have been greatly promoted when the epidemic is sweeping the world. Since speech is one of the most important communication mediums in conferences, speech processing technology plays a vital role in the meetings and meeting minutes. In recent years, speech processing technologies in the meeting scenarios have attracted a large number of researchers. Although driven by the development of deep learning, speech processing technologies such as speech recognition, speaker recognition, and speech synthesis have made great progress, there are still many problems to be solved in meeting scenarios. Among them, the existing speaker change detection and speech recognition models cannot process fast alternating conversations or even speaker overlap. There are problems such as complicated processes and error propagation when multiple speech processing systems are integrated. All the problems greatly affect the effectiveness of speech processing technology in meeting scenarios. This dissertation mainly focuses on the improvement and innovation of key speech processing technologies in meeting scenarios. The main innovation results are as follows: 1. For the variability of speakers in meeting scenarios, this dissertation pro- poses a speaker adaptation method for the speech recognition model. In the meeting scenarios, the speech recognition model needs to deal with speakers not involved in training. Due to the scarcity of labeled data in the meeting scenarios, the speech recognition model has a certain performance degradation when decoding for unknown speakers. This dissertation proposes an effective solution to the problem of how to im- prove the robustness of speech recognition models for unknown speakers. This dissertation proposes a speaker adaptation technology suitable for speech-transformer. It uses the attention mechanism to learn the linear combination of i-vector as a ”soft” speaker representation, and the speaker representation is used for speaker-aware training. Experiments are carried out to explore the optimal configuration of the speaker adaptation method, and the effect of the method under different conditions is also shown. Further, this dissertation attempts to use the pre-training model of wav2vec 2.0 to learn speaker representation, which confirms the effectiveness of this unsupervised pre-training approach in speaker representation learning and uses the speaker embedding extracted by the wav2vec 2.0 to replace the i-vector used in speaker-adaptive training of the speech-transformer. Finally, the effectiveness of the speaker adaptation method for speech recognition proposed in this dissertation is verified on a real recording meeting dataset. 2. Fortheproblemoflongspeechsegmentationinmeetingscenarios, this dissertation proposes a speaker change detection method based on sequence transduction. This dissertation proposes a difference-based integrate-and-fire (DCIF) mechanism to conduct the sequence transduction. The DCIF segments the frame-level speaker representations according to its automatically detected speaker change points and generates speaker representations for each segment. Therefore, DCIF connects the frame-level speaker encoder and the segment-level speaker classifier to complete the transformation from the input feature sequence to the output speaker identity sequence. The training of the sequence-level speaker change detection model only relies on the label of the speaker identity sequence and does not require the precise timestamp annotation of speaker change points. Compared with most previous speaker change detection methods, the requirement for manual annotation is greatly reduced. In addition, to better converge the entire sequence-level speaker change detection model, this dissertation also proposes some training techniques: (1) Using length normalization to alleviate the problem of large numerical fluctuations in the segment-level speaker representation output by the DCIF; (2) The use of multi-label focal loss enables the model to handle multi-speaker overlap and alleviate the unbalanced problem of positive and negative samples; (3) The encoder uses the time delay neural networks to reduce the frame rate, thereby reducing the learning difficulty of the DCIF. Finally, comparative experiments on a real recording meeting dataset demonstrate the effectiveness of the sequence-level speaker change detection and its advantages over frame-level methods. 3. Fortheshortcomingsofcascadingthespeakerchangedetectionandspeech recognition in the meeting scenarios, this paper proposes a joint model for the two modules. The speech data in the meeting scenarios is long-term speech including multiple speakers. Most speech recognition methods are designed for single-speaker speech, and the duration of the speech cannot be too long. Therefore, for the meeting scenarios, cascading speaker change detection and speech recognition is one of the feasible solutions. However, the cascading method of speaker change detection and speech recognition has defects such as complicated processes and error transmission. This dissertation proposes a method to jointly model the two modules. Under the joint framework, the speech recognition part adopts the continuous integrate-and-fire (CIF) based encoder-decoder structure. The speaker change detection part obtains character-level speaker representations by using the continuous integrate-and-fire mechanism and information weight sharing between speech recognition and speaker identification and integrates both semantic and speaker representation information for character-level speaker change detection. On the other hand, character-level speaker representations are used for the speaker adaptation training of speech recognition part. The joint model can directly process speech containing multiple speakers and output character sequences and the speaker change. Finally, the experiments on a real recording meeting dataset show that jointly modeling the speaker change detection and the speech recognition gains on both tasks simply and efficiently.
关键词	会议场景，语音识别，说话人转换点检测，说话人自适应
语种	中文
七大方向——子方向分类	语音识别与合成
国重实验室规划方向分类	语音语言处理
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/49722
专题	复杂系统认知与决策实验室_听觉模型与认知计算毕业生_博士学位论文
推荐引用方式 GB/T 7714	范志赟. 会议场景智能语音处理技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
会议场景智能语音处理技术研究-22071（3323KB）	学位论文		开放获取	CC BY-NC-SA