Since telephone is the only ubiquitous communications terminal device in current world, it is the largest potential application field for speech techniques. Automatic speech recognition (ASR) is a core technique for such telephone-based speech applications. However, it has been proved that a perfect laboratory ASR system may become very vulnerable in real telephony environment. And the robustness is the life-and-death issue for such commercial ASR systems. In this study, we present our recent progresses on improving the performance for Mandarin telephony ASR. Chinese is a tonal language and the tone information is important for Mandarin ASR. However, the filtering effect of telephone channels causes error increase when we apply traditional pitch extraction methods to telephony speech. This is a hindrance to high performance ASR. We adopt an improved anti-bias autocorrelation function (ACF) and integrate the ACF intensity with statistic voice/unvoice (V/U) decision in pitch path tracking. This makes the V/U error decreased to 24% of traditional method. The word error rate (WER) relatively decreases 6.5% in isolated word recognition. Robust speech feature is the premise for high performance ASR. However, our limited knowledge of speech production and perception prevent,; us from obtaining a feature set that has no relations with channel conditions. So compensation is essential if channel mismatch exists between training and testing stage. Channel compensation can be particularly difficult in applications where nonlinear distortion exists. Simple cepstral mean estimates and cepstral filtering methods are unreliable. To address this problem, a quasi-linear channel model is constructed. With the pure speech statistic knowledge, we propose a maximum-likelihood channel estimation method, which makes the character error rate (CER) relatively decrease 20% in telephony large vocabulary Mandarin ASR. To solve the data sparsing problem occurs in fast compensation, we extend the previous method by introducing a phone-conditioned prior channel distribution and use Bayesian techniques for estimation, which provides additional 7% relative CER decrease. Different with previous methods, the novel algorithm works well for both fixed-line channels and compressed wireless channels. Acoustic adaptation is an essential part for the state-of-the-art ASR system. Based on cascaded linear transform adaptation, we propose a novel parameterization type. It could effectively decrease the transform parameter number with the high precision advantage of full matrix maintained, which means a more robust estimation. Full transform could be constructed upon smaller regression class and higher resolution is achieved. It outperforms previous cascade method with varying amounts of data. Finally we discuss the strategies of noise rejection and out-of-vocabulary (OOV) rejection for continous natural speech input. We use syllable-based filler model and
修改评论