面向鸡尾酒会问题的视觉辅助语音分离算法研究
张鹏
2021-05-27
页数108
学位类型硕士
中文摘要

语音作为人类社会最自然、最有效、最便捷的信息交互方式之一,在人们的日常生活中起着关键作用。目前,在简单的声学场景中,语音识别 (例如,近场语音识别) 和说话人识别都已获得十分优异的性能。然而,在实际生活中,我们时时刻刻处在复杂的声学场景中,环境噪音、人声干扰以及混响都会不同程度地干扰语音的质量,从而影响后续语音处理技术。近些年,作为“鸡尾酒会问题”的典型任务,多说话人语音分离技术被广泛研究,旨在将目标语音从背景干扰音中分离出来从而提升目标语音的质量。虽然纯音频语音分离算法在标准数据集上获得了优异的性能,但是它存在的两个问题 (标签排列问题和声源数目不确定问题) 使其难以在真实复杂的声学场景中应用。受人类听觉系统的启发,以及得益于深度神经网络的强大建模能力,基于视觉辅助的语音分离算法开始蓬勃发展起来,并具备在真实复杂声学场景中应用的潜力,但是此类算法仍然存在“语音分离性能不佳、不满足在线处理、部分视觉信息丢失导致性能急剧下降”等问题。针对于此,本文重点关注视觉辅助语音分离算法中的视觉特征提取方法、在线模型的设计方法、鲁棒模型的训练策略以及如何将先进建模和训练方法应用在该算法中,尝试解决该算法中存在的一些实际问题,并有效提升算法性能。本文的主要贡献包括:

1. 本文提出了一种基于动态视觉特征的语音分离模型 AVMS (Audio-Visual speech separation Model using Speech-related visual features)。本文首先对语音分离任务下现有的动态视觉特征提取方法做了分析,发现最有效的动态视觉特征应该与语音具有很强的相关性。本文提出采用学习联合的视听觉表示 (learning joint audio-visual representation) 和领域对抗训练 (domain adversarial training) 的方法来提取该视觉特征,并基于时域编码方法和时序卷积神经网络构建模型。本文在 4 个音视频基准数据集上进行了细致的实验,实验结果表明,AVMS 优于目前最先进的基于视觉辅助的语音分离模型。另外,模型在真实复杂的声学场景中可以取得令人惊艳的语音分离效果,使目标语音的纯净度显著提高,这表明模型具有很强的应用价值。为了缓解部分视觉信息 (视频帧) 丢失时语音分离性能急剧下降的问题,本文提出了随机丢失视频帧的训练策略。该训练策略可以显著增强模型的鲁棒性,使模型在视觉信息随机丢失和视觉信息连续丢失场景下均可以维持较为优异的性能;

2. 本文提出了一种新颖的基于静态视觉特征的语音分离模型。本文分析了现有的基于静态视觉特征的语音分离模型性能差的原因,并提出了相应的优化策略。具体地,导致模型性能差的原因有两个:(1) 现有模型均采用时频域编码方法,该方法存在语音相位难以估计的问题;(2) 人脸图像受多种因素 (例如,光照和位姿等) 的影响产生多样性,并且训练数据有限,无法覆盖完整的样本空间,所以导致模型的泛化能力变差。针对第一个原因,本文采用时域编码方法,并使用门控双通路循环神经网络 (gate dual path recurrent neural networks, GDPRNN) 来构建模型,这不仅可以有效避免相位估计问题,还增强了模型处理时序数据的能力,从而提升模型性能。针对第二个原因,本文提出利用对抗训练 (adversarial training) 方法在视觉特征层面来隐式建模人脸图像的多样性,从而提升模型的泛化能力。实验结果表明,上述策略可以显著提升模型的性能,与基线模型相比,本文提出的模型信号失真比 (signal-to-distortion ratio, SDR) 的提升为 106%;

3. 本文提出了一种基于生成对抗训练的在线视觉辅助语音分离模型,目的是拓展模型在视频通话和人机交互等在线场景中的应用。具体地,本文采用因果时序卷积神经网络构建模型,使其在理论上满足实时处理的要求;另外,本文提出了在线流式推理策略,该策略可以使模型在 GPU、CPU 和手机芯片上部署并满足在线语音分离的要求,并且不会造成任何性能损失。为了缓解尺度不变信噪比 (scale-invariant signal-to-noise ratio, SI-SNR) 损失函数造成的负面影响,即分离语音缺乏真实语音的精细结构,本文采用生成对抗训练方法来优化整个模型,通过生成器 (在线视觉辅助语音分离模型) 和鉴别器之间的博弈对抗来使分离语音在高维空间中趋近于真实语音。实验结果表明,该方法在不增加任何模型参数的情况下,可以同时提升分离语音的听感和语音识别性能。本文首次探索了面向端侧部署的在线视觉辅助语音分离模型的设计方法,为实现该模型在在线场景中的应用迈出了重要一步。

演示视频及模型处理前后的视频和音频样例见我们的主页:https://demo2show.github.io/Samples/。

英文摘要

As one of the most natural, effective, and convenient ways of information exchange in human society, speech plays a key role in people's daily life. At present, speech recognition (e.g., near-field speech recognition) and speaker verification have achieved excellent performance in simple acoustic scenarios. However, in the real world, we are in complex acoustic scenes all the time, and environmental noise, human voice, and reverb will interfere with the quality of speech signals, thus affecting the subsequent speech processing technology. In recent years, as a typical task of the "cocktail party problem", multi-speaker speech separation has been widely studied in order to separate the target speech from the mixture and improve the target speech's quality. Although audio-only speech separation methods achieve state-of-the-art performance on standard datasets, the two problems (i.e., label permutation and the unknown number of sources) make them difficult to be applied in real-world complex acoustic scenarios. Inspired by the human auditory system, and benefit from the deep neural networks' strong modeling ability, audio-visual speech separation (AVSS) methods began to flourish, and have the potential to apply in the real-world complex acoustic scenes, but such methods still have many problems, i.e., poor performance, cannot meet the online processing, performance drop caused by partially visual information missing, and etc. Thus, this thesis focuses on the method of extracting visual features in AVSS, the method of designing the online model, the training strategy for the robust model, and the application of advanced modeling and training methods in AVSS, trying to solve some practical problems in the models, and effectively improve its performance. The main contributions of this thesis include:

1. This thesis proposes an audio-visual speech separation model using speech-related visual features. This thesis firstly analyzes the existing dynamic visual features extraction methods for speech separation tasks and finds that the most effective visual features should have a strong correlation with speech. We adopt the methods of learning joint audio-visual representation and domain adversarial training to extract such visual features, and the model is built based on the time domain coding and temporal convolutional neural networks. The detailed experiments are carried out on four audio-visual benchmark datasets. The experimental results show that our proposed model outperforms the current state-of-the-art audio-visual speech separation models. In addition, our model can achieve amazing performance in real-world complex auditory scenes and improve the purity of the target speech significantly, which indicates that the model has a strong practical value. In order to alleviate the problem that the performance drop caused by partially visual information missing, this thesis proposes a training strategy of random missing video frames. Experimental results show that the proposed training strategy can significantly enhance the robustness of the model, and maintain good performance under both random and continuous visual information missing scenarios;

2. This thesis proposes a new audio-visual speech separation model using static visual features. We analyze the reasons for poor performance in existing models, and the corresponding improvement strategy is proposed. Specifically, there are at least two reasons. The first reason is that the existing models all adopt the time-frequency domain coding method, which has the problem of difficult phase estimation. The second reason is that the face image is influenced by a variety of factors (e.g., illumination, pose, etc.) and thus has the characteristics of diversity, besides, due to the amount of data for the training model is limited and thus the complete sample space that cannot be covered, which results in a poor generalization capability of the model. For the first reason, this thesis adopts the time domain coding method and uses the gate dual-path recurrent neural networks to build the model, which not only effectively avoids the phase estimation problem, but also enhances the ability of processing time-series data, thus effectively improving the performance of the model. For the second reason, we propose that the diversity of face images is implicitly modeled by the adversarial samples at the level of visual features, so as to improve the generalization ability of the model. The experimental results show that the above strategy can significantly improve the performance of the model. Compared with the baseline model, the signal-to-distortion ratio (SDR) of our model is improved by 106%;

3. This thesis proposes an online audio-visual speech separation with generative adversarial training, the goal is to extend the application of the audio-visual speech separation model in the online scene (e.g., video communication and human-computer interaction). Specifically, this thesis uses causal temporal convolutional neural networks to build the model, so that it can theoretically meet the real-time processing requirements. In addition, an online streaming inference strategy is proposed, which enables the model to meet the requirements of online speech separation on GPU, CPU, and mobile chips without any performance drop. In order to alleviate the negative effects caused by the loss function of scale-invariant signal-to-noise ratio, that's separated speech lacks the fine structure of clean speech. This thesis uses the generative adversarial training method to optimize the whole model and make a separated speech and clean speech more similar in high dimensional space through the adversarial game between the generator (the online audio-visual speech separation model) and the discriminator. The experimental results show that the proposed method can not only improve the auditory perception of separated speech but also significantly improve the performance of speech recognition without adding additional parameters. This thesis explores the design method of the online audio-visual speech separation model towards end-to-side deployment for the first time, which is an important step for the application of the model in online scenes.

Demo and samples are available on our webpage: https://demo2show.github.io/Samples/.

关键词鸡尾酒会问题 语音分离 视觉辅助 在线流式处理 生成对抗训练
学科领域人工智能
学科门类工学::控制科学与工程
DOI0
URL查看原文
语种中文
引用统计
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/44895
专题数字内容技术与服务研究中心_听觉模型与认知计算
推荐引用方式
GB/T 7714
张鹏. 面向鸡尾酒会问题的视觉辅助语音分离算法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2021.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
张鹏-硕士学位论文-终版.pdf(8406KB)学位论文 开放获取CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[张鹏]的文章
百度学术
百度学术中相似的文章
[张鹏]的文章
必应学术
必应学术中相似的文章
[张鹏]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。