In the past few years, Deep Neural Network (DNN) has been widely used in large vocabulary continuous speech recognition (LVCSR). Meanwhile, the DNN-based acoustic model achieves significant improvement over traditional GMM-based models, and further promotes the speech recognition system to satisfy the requirements of practical applications. With the development of internet, the amount of speech training data increases explosively from dozens of hours to thousands of hours nowadays. Hence, it becomes an urgent problem to exploit such large scale of speech data to train a high-performance recognition system efficiently. In this thesis, we study on the issues of DNN-based large-scale acoustic model training and several specific application problems of speech recognition technology. The main work and contributions include: 1. For the DNN pre-training problem, we propose to apply the Deep Boltzmann Machine (DBM) pre-training model for the DNN training procedure in LVCSR. In the task of phone recognition on TIMIT dataset, the DBM-DNN achieves 3.8% relative PER reduction on the core test set comparing with Deep Belief Network based DNN (DBN-DNN). 2. To train DNN on multi-GPUs from single-server, we propose to apply the one pass learning algorithm based on average stochastic gradient descent (ASGD) to the DNN training procedure. Furthermore, by combining with the asynchronous parallel mode, one pass learning algorithm successes to operate on multiple GPUs of single-server. The asynchronous ASGD algorithm accelerates the DNN training speed by 5.3 times, comparing with asynchronous stochastic gradient algorithm. 3. For the distributed DNN training problem, we propose a novel GPU cluster training pattern based on the Stochastic Hessian Free (SHF) algorithm, and effectively solve the problem of demanding high communication bandwidth between machines in the asynchronous parallel algorithm. Specially, the SHF algorithm speeds up the DNN training procedure on GPU cluster obviously comparing with asynchronous parallel algorithm. 4. For the mixed-bandwidth training problem, we propose an DNN adaptation approach to train DNN on mixed-bandwidth speech data, and achieves better performance than the feature zero-padding based mixed-bandwidth training methods. Besides, by exploiting the singular value decomposition (SVD) algorithm, we accomplish training DNN on the GPU cluster with 24 GPUs with 7500 hours of mixed-bandwidth speech data in seven days.
修改评论