CASIA OpenIR  > 毕业生  > 博士学位论文
面向低功耗的语音增强与分离算法研究
黄雅婷
2022-08-16
页数126
学位类型博士
中文摘要

智能语音处理成为人机交互过程中重要的一环,被用到越来越多的智能设备中。语音增强与分离算法通常作为智能语音设备中的前端模块,增强和分离出目标语音,以提升自动语音识别、说话人识别等后端模块的识别性能。由于其实用价值,语音增强与分离算法被广泛地研究,并且是语音信号处理领域中的一个重要研究课题。语音增强与分离算法最初源于对鸡尾酒会问题的研究。回顾语音增强与分离算法的发展历程,当今主流研究逐渐往基于深度神经网络的语音增强与分离算法发展。尽管基于深度神经网络的语音增强与分离算法在标准数据集上取得了卓越的性能,由于其模型复杂性和计算量,部署到资源受限、功耗受限的端侧设备上仍面临挑战。与智能机器相比,动物的听觉系统往往能以更低的功耗,较为高效鲁棒地处理复杂的听觉场景。本文将围绕面向鸡尾酒会问题的听觉场景,从更节能、更轻量化、更鲁棒的角度出发,探索和研究面向低功耗的语音增强与分离算法。本文的主要贡献包括以下几点:

1. 本文提出一种基于脉冲编码与序列学习的语音增强与分离算法。脉冲神经网络是第三代神经网络,可以学习输入刺激的精确脉冲序列。脉冲神经网络的事件驱动特性使得其部署在专用芯片上时具有低功耗、高效率的特点。同时语音信号富含丰富的时空结构,因此脉冲神经网络是学习语音的时空结构的一个自然选择。本文首次将有监督的脉冲神经网络应用到语音增强与分离算法建模中。为了将语音刺激转换成脉冲序列,受神经科学中发现的编码方式的启发,我们提出两种时序编码方式,即时序-频率编码和时序-群体编码。我们进一步将动量和 Nesterov 加速梯度引入到远程监督方法(Remote Supervised Method, ReSuMe)中,分别得到 ReSuMe-M 和 ReSuMe-NAG,提高脉冲神经网络训练收敛的速度和性能。实验结果表明,脉冲神经网络在建模语音增强与分离任务上具有一定的潜力。

2. 本文提出一种基于知识蒸馏和量化训练的语音增强与分离算法。虽然上一个工作的实验结果显示脉冲神经网络在应用到语音增强与分离任务时具有一定的潜力,但是由于缺乏有效的优化训练算法去提高脉冲神经网络的准确率,脉冲神经网络的性能有待提升。降低功耗的另一个思路是运用模型压缩技术对复杂模型进行压缩,减小参数量和模型尺度,降低计算成本。由此,本文利用模型压缩技术并提出蒸馏敏感的量化(Distillation-Aware Quantization, DAQ)算法对基于声纹辅助的深度神经语音增强与分离模型进行压缩。 DAQ 算法结合量化技术和知识蒸馏,将模型的权重应用量化函数做逐层非均匀量化,将激活应用最小-最大线性量化进行 8 比特量化。为了进一步提升低精度模型的性能,我们引入知识蒸馏,将全精度模型当作教师模型,将低精度模型当作学生模型。 DAQ 能够以端到端的方式进行训练。我们在之前提出的语音增强与分离模型 WASE 上使用 DAQ,并提出用 DAQ 训练的低精度版本 TinyWASE。 WSJ0-2mix 的结果显示我们提出的方法在权重量化到 3 比特的情况下也能实现和全精度模型相媲美的性能,实现了 8.97 的压缩比和 2.15MB 的模型尺寸。实验结果还显示我们提出的 TinyWASE 可以和其他模型压缩算法结合,比如参数共享,通过牺牲一定的性能实现达到 23.81 的压缩比。

3. 基于上一个工作,本文进行进一步拓展,融合声纹线索和视觉线索,提出一种基于组通信的多模态多通道的轻量化语音增强与分离算法。我们提出 LiMuSE(Lightweight Multi-modal Speaker Extraction)。 LiMuSE 将组通信(Group Communication)模块引入到时序卷积网络(Temporal Convolutional Networks, TCN)组成基于组通信的 TCN 模块。基于组通信的 TCN 网络被用在上下文编译码器,沿着时间维度将长语音序列压缩成更短的时序序列;同时被用在听觉模块和融合模块中,沿着特征维度压缩模型以减轻主干网络的建模负担。LiMuSE 进一步利用量化技术压缩模型尺寸。在 GRID 数据集上的实验结果表明,引入组通信和上下文编译码器到多模态模型中能够以更少的参数和更小的模型复杂度实现和全精度模型媲美甚至略优的性能。

4. 尽管人类听觉系统能同时从混合语音中提取出目标信号并且恢复受损或者缺失的部分,目前主流语音增强与分离算法一般仅关注目标语音的增强与分离,而无法恢复受损或缺失的部分。本文将再进一步,研究听觉感知恢复的计算建模及其轻量化优化。本文提出一种基于听觉感知恢复的语音增强与分离算法,其目标是恢复含有噪声含有缺失的混合语音中的目标语音。我们提出 HCRN(Hourglass-shaped Convolutional Recurrent Networks)去抑制背景噪音的同时恢复目标语音中缺失的部分。为了进一步提升算法的性能,我们提出时频域损失。实验的定量分析和定性分析的结果表明,我们提出的用时频域损失训练的 HCRN具有抑制背景噪音、根据不可靠的上下文识别和恢复混合语音中显著信号的缺失部分的能力。在此基础上,本文进一步从轻量化设计的角度对 HCRN 进行优化,提出 HTCN(Hourglass-shaped Temporal Convolutional Networks),在缩减模型参数量以及计算量的同时,获得和 HCRN 相媲美的性能。

英文摘要

Intelligent speech processing has become an important part that interacts humans and intelligent machines, which is used in more and more intelligent devices. Speech enhancement and separation algorithms are usually served as front-end modules in intelligent speech devices to enhance and separate the target speech, in order to improve the recognition performance of back-end modules such as automatic speech recognition and speaker recognition. Speech enhancement and separation algorithms are widely studied and are an important research topic in the field of speech signal processing due to their practical values. The research of speech enhancement and separation algorithms originated from the study of the Cocktail Party Problem. Reviewing the development process, speech enhancement and separation algorithms have gradually evolved into deep-learning-based methods in recent years. Despite the superior performance on the benchmark datasets, it still remains challenging to deploy these algorithms on resource-constrained energy-constrained edge devices due to model complexity and expensive computational costs. Compared with intelligent machines, the auditory systems of animals are better at processing complex auditory scenes with lower energy consumption more efficiently and robustly. Therefore, focusing on the auditory scenes in the Cocktail Party Problem, the thesis aims to explore and study energy-efficient speech enhancement and separation algorithms that are more energy-saving, lightweight and robust. The main contributions are listed as follows:

1. A speech enhancement and separation algorithm using spiking neurons with temporal coding and supervised learning is proposed. Spiking Neural Networks (SNNs) are regarded as the third generation of neural network models, which can learn the precise spike trains of the stimuli. When deployed on the neuromorphic-ai-chips, SNNs have the characteristics of low energy consumption and high efficiency due to their event-driven nature of computation. As speech signals exhibit strong temporal structure, SNNs are a natural choice for learning the temporal dynamics of speech. In this work, we take a pioneering step towards using SNNs in modeling speech enhancement and separation. In order to transform the auditory stimuli into spikes, we propose two temporal coding schemes, namely Temporal-Rate coding and Temporal-Population coding, which is inspired by the coding schemes found in neuroscience. We further introduce momentum and Nesterov's Accelerated Gradient into the Remote Supervised Method (ReSuMe), leading to ReSuMe-M and ReSuMe-NAG, to improve the performance and speed up the spike train learning. The performance of our model demonstrates the potential of spiking neural networks modeling speech enhancement and separation tasks.

2. A speech enhancement and separation algorithm using knowledge distillation and quantization-aware training is proposed. Although the experimental results of the previous work suggest that spiking neural networks show potential in modeling speech enhancement and separation tasks, there is a lack of efficient algorithmic optimizations for improving the accuracy of SNN computations and the performance is yet to be improved. Another idea to reduce power consumption is to use model compression techniques to compress the complex models, reducing the number of parameters, the model complexity and the computational costs. In this work, we utilize model compression techniques and propose Distillation-Aware Quantization (DAQ) to compress the voiceprint-assisted deep-learning-based speech enhancement and separation algorithms. DAQ combines quantization techniques and knowledge distillation. It quantizes the weights of the model to ultra-low bits using layer-wise non-uniform quantization functions and quantizes the activations to 8 bits using min-max linear quantization. To further improve the performance of the low-precision model, DAQ introduces knowledge distillation, which treats the full-precision model as the teacher model and treats the low-precision model as the student model. DAQ can be trained in an end-to-end way. We study DAQ on our previously proposed speech enhancement and separation model WASE and propose the low-precision version TinyWASE trained by DAQ. Experiments on WSJ0-2mix dataset show that our proposed method can achieve comparable performance as the full-precision model while quantizing the weights to 3 bits, obtaining a compression ratio of 8.97 and a model size of 2.15 MB. We further show that our TinyWASE can combine with other model compression techniques, such as parameter sharing, achieving a compression ratio as high as 23.81 with limited performance degradation.

3. Based on the previous work, we further integrate the voiceprint cue and the visual cue, and propose LiMuSE (Lightweight Multi-modal Speaker Extraction), which is a multi-modal multi-channel lightweight speech enhancement and separation algorithm based on Group Communication. LiMuSE adopts Group Communication (GC) to Temporal Convolutional Networks (TCN) to form GC-equipped TCN blocks. GCequipped TCN blocks are used in Context Codec (CC) along the temporal dimension to squeeze the long speech sequences into much shorter ones, and used in audio blocks and fusion blocks along the feature dimension to lighten the modeling burden of the backbone model. LiMuSE further applies quantization techniques to compress the model size. The experiments on the GRID dataset show that incorporating GC and CC into the multi-modal framework achieves on par or better performance with fewer parameters and less model complexity.

4. Although the human auditory system can simultaneously extract the target signal from the mixed speech and restore the interrupted or missing parts, the current popular speech enhancement and separation algorithms only focus on the enhancement and separation of the target speech, and can not restore the interrupted or missing parts. In this work, we further study the computational modeling of auditory perceptual restoration and optimize the model design to compress the model complexity. In this work, we propose a speech enhancement and separation system towards modeling auditory perceptual restoration, that is to reconstruct and restore the target speech signal, in which there are missing parts in the noisy mixtures. We propose Hourglass-shaped Convolutional Recurrent Networks (HCRN) to enhance the target signal and restore missing gaps simultaneously. To further improve the performance, Spectro-Temporal loss is proposed. Both the quantitative and the qualitative performance show that our proposed HCRN trained with Spectro-Temporal loss can suppress the background noise, and identify and restore the missing gaps of the salient signal with the unreliable context information. On the basis of HCRN, we optimize the model from the perspective of lightweight model design and propose Hourglass-shaped Temporal Convolutional Networks (HTCN), which can achieve comparable performance with fewer parameters and smaller computational costs.

关键词语音增强与分离 脉冲神经网络 模型压缩 听觉感知恢复
语种中文
七大方向——子方向分类类脑模型与计算
国重实验室规划方向分类语音语言处理
文献类型学位论文
条目标识符http://ir.ia.ac.cn/handle/173211/49721
专题毕业生_博士学位论文
中国科学院自动化研究所
毕业生
推荐引用方式
GB/T 7714
黄雅婷. 面向低功耗的语音增强与分离算法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
条目包含的文件
文件名称/大小 文献类型 版本类型 开放类型 使用许可
黄雅婷_面向低功耗的语音增强与分离算法研(3433KB)学位论文 限制开放CC BY-NC-SA
个性服务
推荐该条目
保存到收藏夹
查看访问统计
导出为Endnote文件
谷歌学术
谷歌学术中相似的文章
[黄雅婷]的文章
百度学术
百度学术中相似的文章
[黄雅婷]的文章
必应学术
必应学术中相似的文章
[黄雅婷]的文章
相关权益政策
暂无数据
收藏/分享
所有评论 (0)
暂无评论
 

除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。