A language model (LM) is an important part of a speech recog-
nition system. Language model adaptation techniques use a
large amount of source domain data and limited target domain
data to improve the performance of language models in target
domain. Even though text datasets are easy to obtain, there is
no public Chinese text dataset for language model adaptation
tasks. This paper presents a language model adaptation dataset
which consists of four different domains of news data, i.e. sport,
stock, fashion, finance. The discrepancy between the domains
of data is evaluated. Model combination based adaptation of
n-gram is evaluated on the dataset. Three different fine-tuning
adaptation methods of recurrent neural network language mod-
els (RNNLMs) are evaluated. WER results on AIShell speech
data with the language models trained on this dataset are also
provided. The absolute WER reduction of lattice rescoring with
adapted RNNLM is 4.74%.
修改评论