感知线索辅助的语音分离技术研究

	感知线索辅助的语音分离技术研究
	郝云喆
	2022-06-18
页数	121
学位类型	博士
中文摘要	语音分离是指让机器完成从采集的混合语音中自动分离重构目标语音信号的过程，其基本设想是让计算机像人类听觉系统一样在嘈杂环境下拥有选择性聆听的能力。语音盲分离任务要求分离出混合语音包含的所有语音流，通常面临以下两个问题：声源数目不可知问题（unknown source number problem）和标签排列问题（label permutation problem），难以直接落地应用。随着智能终端设备的广泛普及，诸如家居环境、车载环境和视频会议等场景对特定说话人的语音分离技术拥有巨大而迫切的需求。在此背景下，如何利用目标说话人相关的感知线索来辅助语音分离，推动特定场景下技术落地，成为一项热门的研究方向。本文对线索辅助语音分离模型中的相位不匹配问题、低延时流式推断策略、多感知线索调制机制以及线索缺失等关键问题进行了研究。本文的主要研究成果总结如下： 1. 基于声纹线索的语音分离算法研究：声纹辅助语音分离模型大多采用时频域编码，一般会对幅度相位进行解耦，存在相位不匹配问题。另外，现有工作缺乏面向端侧部署时对模型因果性、流式推理策略的思考。本文提出了一种基于时域编码的声纹辅助语音分离框架，探究了在线流式推理策略。具体地，所提模型采用时域编码将信号直接映射到高维嵌入空间，避免了传统时频域编码的相位不匹配问题，提高了模型性能上限。所提模型以时序卷积网络（Temporal Convolutional Networks, TCN）和双通道循环神经网络（Dual-Path Recurrent Neural Network, DPRNN）作为网络主体架构，探究了非因果、因果条件下模型性能表现。为了模型在端侧能够流式推理，本文针对 TCN 架构模型设计了编解码器端点处理机制和 TCN 网络隐层信息动态缓存复用机制。在语音分离基准数据集 WSJ0-2mix 上的实验结果验证了所提模型的有效性，在华为麒麟 990 芯片端侧部署实验验证了所提流式处理策略的有效性。 2. 基于声纹诱发起止线索的语音分离算法研究：受听觉场景分析和认知心理学中起始线索的启发，本文提出了基于起止线索的语音分离模型 WASE（learning When to Attend for Speaker Extraction），补充了注意选择在时间维度上的作用机制。具体地，WASE 模型通过注册语音获得声纹表征，依赖声纹表征从混合语音中检测目标说话人的起止时间信息，最后利用感知线索辅助的语音分离技术研究该信息指导分离。根据是否包含停止时间信息，文中分别提出了起始时间线索和起止时间线索。起始线索抑制了起始前的信息，引导模型从特定时间开始分离目标语音；起止线索是对起始线索的进一步补充，引导模型从特定时间区间分离目标语音。进一步地，本文还将起止线索和声纹线索整合，使得模型能够从时间域、特征域两个维度来调制混合语音。在语音分离基准数据集 WSJ0-2mix 上的实验结果表明，基于起止线索的分离模型获得了和声纹线索可比的性能，双线索联合调制相比单一线索获得了更优的性能，这验证了起止线索的有效性和双线索调制的优势。 3. 基于多感知线索的语音分离算法研究：受人类听觉系统上行通路中空间信息、视觉信息和声纹信息的层次化调制机制启发，本文提出了基于方位、视觉、声纹次序的多感知线索层次化调制的语音分离模型。本文首先基于开源音视频数据集 GRID 仿真生成多通道语音，构建了包含空间信息、视觉信息和说话人信息的全线索语音分离数据集。基于该数据集，本文从方位角信息、视觉唇动和声纹特征三个方面对目标说话人进行建模表征，训练了多线索辅助的语音分离模型。针对现实场景下线索污染甚至不可获取问题，本文提出了线索缺失训练策略，减弱了各子模块之间的相互依赖关系，提高了模型在各种线索缺失场景下的鲁棒性。最终实验表明，多线索辅助分离模型性能大幅优于单一线索，并且能够有效且高效地处理线索缺失场景，验证了多线索联合调制的优势，和线索缺失训练策略的有效性。
英文摘要	Speech separation refers to the process of allowing the machine to automatically separate and reconstruct the target speech signal from the collected mixture. The basic idea is to let the computer have the ability to selectively listen in a noisy environment like the human auditory system. The blind speech separation task requires outputting all speech streams in the mixture and usually faces the following two problems, the unknown source number problem, and the label permutation problem, which makes it difficult to directly apply. With the widespread popularity of smart devices, scenarios such as home environments, vehicle environments, and video conferences have huge and urgent needs for speech separation technology for specific speakers. In this context, how to use the perceptual clues related to the target speaker to assist speech separation and promote the implementation of technologies in specific scenarios has become a hot research direction. In this dissertation, key issues succh as phase mismatch, low-latency streaming inference strategies, multi-perceptual clue modulation mechanisms, and clue missing in clue-assisted speech separation models are studied. The main research results of this dissertation are summarized as follows. 1. Research on speech separation algorithm based on voiceprint clues. Most of the voiceprint-assisted speech separation models adopts time-frequency domain coding, which generally decouples the amplitude and phase, and there is a phase mismatch problem. In addition, the existing work lacks the consideration of model causality and streaming inference strategies when deploying to the device-side. This dissertation proposes a voiceprint-assisted speech separation framework based on time-domain coding and explores online streaming inference strategies. Specifically, the proposed model uses time-domain coding to directly map the signal to a high-dimensional embedding space, which avoids the phase mismatch problem and improves the upper limit of model performance. The proposed model uses Temporal Convolutional Networks (TCN) and Dual-Path Recurrent Neural Network (DPRNN) as the backbone to explore the performance of the model under non-causal and causal conditions. For the model to be able to stream inference on the device side, this dissertation designs the encoder endpoint processing mechanism and the TCN network hidden layer information dynamic cache multiplexing mechanism for the TCN architecture model. The experimental results on the speech separation benchmark dataset WSJ0-2mix verify the effectiveness of the proposed model, and the device-side deployment experiments on Huawei Kirin 990 chip verify the effectiveness of the proposed streaming strategy. 2. Research on speech separation algorithm based on onset-offset clues induced by voiceprint. Inspired by auditory scene analysis and onset clues in cognitive psychology, this dissertation proposes a speech separation model WASE (learning When to Attend for Speaker Extraction) based on onset-offset clues, which supplements the action mechanism of attention selection in the time dimension. Specifically, the WASE model obtains the voiceprint representation by registering the speech, relies on the voiceprint representation to detect the start and end time information of the target speaker from the mixture, and finally uses this information to guide the separation. According to whether the stop time information is included, an onset clue and an onset-offset clues are proposed in this dissertation. The onset clue suppresses the information before the start time and guides the model to separate the target speech from a specific time; the onset-offset clues are further supplements to the onset clue and guide the model to separate the target speech from a specific time interval. Further, this dissertation also integrates the onset-offset clues and the voiceprint clues, so that the model can modulate the mixture from the two dimensions of the time domain and the feature domain. The experimental results on the speech separation benchmark dataset WSJ0-2mix show that the separation model based on the onset-offset clues achieves comparable performance to the voiceprint clues, and the dual-clue joint modulation achieves better performance than a single clue, which verifies the effectiveness of onset-offset clues and the advantages of dual-clue modulation. 3. Research on speech separation algorithm based on multi-perceptual clues. Inspired by the hierarchical modulation mechanism of spatial information, visual information, and voiceprint information in the upward pathway of the human auditory system, this dissertation proposes a speech separation model based on sequential hierarchical modulation of azimuth, visual and voiceprint multi-perceptual clues. This dissertation firstly generates multi-channel speech based on the open-source dataset GRID simulation and constructs a full-clue speech separation dataset containing spatial information, visual information, and speaker information. Based on this dataset, this dissertation characterizes the target speaker from three aspects: azimuth information, visual lip movement, and voiceprint features, and trains a multi-clue-assisted speech separation model. Aiming at the problem of clue pollution or even inaccessibility in real scenarios, this dissertation proposes a training strategy for missing clues, which weakens the interdependence between sub-modules and improves the robustness of the model in various scenarios with missing clues. The final experiments show that the performance of the multi-clue-assisted separation model is significantly better than that of a single-clue, and it can effectively and efficiently handle the clue-missing scene, which verifies the advantages of multi-clue joint modulation and the effectiveness of the clue-missing training strategy.
关键词	鸡尾酒会问题语音分离声纹线索起止线索多感知线索
语种	中文
文献类型	学位论文
条目标识符	http://ir.ia.ac.cn/handle/173211/48872
专题	复杂系统认知与决策实验室_听觉模型与认知计算
推荐引用方式 GB/T 7714	郝云喆. 感知线索辅助的语音分离技术研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.

条目包含的文件
文件名称/大小	文献类型	版本类型	开放类型	使用许可
郝云喆_感知线索辅助的语音分离技术研究_（5007KB）	学位论文		开放获取	CC BY-NC-SA