英文摘要 | Speech is one of the most natural ways for humans and machines to interact. Through speech processing technology, human intent can be passed directly to the machine. At present, near-field speech recognition and speaker recognition have achieved the state-of-the-art performance. However, in a real environment, the speech signal is inevitably interfered by noise, reverberation, and other speakers, which damages the speech intelligibility and quality and degrades the performance of subsequent speech processing technologies. The speech front-end technology is designed to eliminate noise and reverberation in the acoustic environment and focus on the target signal while keeping the speech quality as much as possible. This is one of the most critical core technologies and important research topics in the field of speech signal processing.
In recent years, neural network-based speech front-end methods have gradually become the mainstream due to their excellent performance. However, there still exists problems such as phase mismatch, poor generalization of the model, and mismatch between the simulated data and the real data. These problems still need to be effectively solved. Based on the basic theories and cutting-edge methods in the speech front-end, this paper takes deep learning as the main method, and uses the inherent acoustic characteristics of speech and the theoretical basis of acoustic characteristics of noise/reverberant scenes to conduct in-depth research and exploration on the front-end models. The main work and innovations are as follows:
1. This work propose a single-channel speech dereverberation system with generative adversarial training, which can effectively enhance reverberate speech. Fine-tuned structure CBLDNN is adopted, which integrates CNN, BLSTM, DNN to improve the performance. Adversarial training is applied to making the indistinguishable from the clean samples in high-dimensional space. The experimental results show that the proposed model outperforms several baseline systems. Besides, this system can deal with wide range reverberation and be well adapted to variant environments. Besides, the offline system is extended to an online system, which can obtain comparable performance with the offline system.
2. This work first investigates the two-dimensional self-attention-based speech enhancement system, which can effectively improve the intelligibility of noisy speech. This attention mechanism focuses on information in two dimensions (e.g., time and frequency dimensions) simultaneously, and the spectral features from two dimensions can be learnt or fused jointly. The network captures global dependencies without recurrence successfully. An edge enhancement network is proposed to model and restore spectral texture details and sharpen the spectrum. There is a phase mismatch in time-frequency domain-based methods, and the performance is not optimal. decline. This work next proposes a multi-scale speech enhancement model based on time domain. The proposed method performs enhancement in the time domain and avoids phase mismatch problem through end-to-end training. A gated mechanism is used to select dominant features and restrain irrelevant information. We apply multi-scale feature extraction to learn different scales of feature representation, and use multi-scale feature fusion to fuse different levels of features that come from different layers. Experimental results show that our model effectively removes noise in speech signals.
3. This work proposes a speaker-independent multi-speaker speech separation method via generative adversarial training. This system aims at obtaining better speech quality instead of only minimizing a mean square error. Model structure is first investigated in depth to better extract acoustic features. In the initial phase, log-mel filterbank and pitch features are utilized to warm up the CBLDNN in a multi-task manner. Thus, the information that contributes to separating speech and improving speech quality is integrated into the model. Generative adversarial training is executed throughout the training, which makes the separated speech indistinguishable from the real one. Time frequency-based methods suffer from phase mismatch. PIT-based methods can not deal with the problem of the unknown number of outputs. We next propose a time-domain based dual-channel speech separation network, which first infers all competing speakers associated with their directions in a sequential manner, and transforms them into source masks to separate the mixture. The speech is separated in the time domain to avoid the problem of phase mismatch. The speaker and direction information are appended to the output, which can be applied to the subsequent task. Experimental resultsd show that our network successfully separates mixture and deals with the unknown number of sources, permutation problem and the selection of output.
4. A cascade far-field speech recognition system is proposed. For front-end, this work conducts the comparisons between several popular beamforming methods. Besides, we also propose an omnidirectional minimum variance distortionless response followed by weighted prediction error. For back-end, several acoustic models and language models with different architectures are deeply investigated. Compared with the baseline system, our proposed method achieves significant performance improvements both on single-array and multi-array scenarios.
|
修改评论