CASIA OpenIR
A Public Chinese Dataset for Language Model Adaptation
Bai, Ye1,2; Yi, Jiangyan1; Tao, Jianhua1,2,3; Wen, Zhengqi1; Fan, Cunhang1,2
Source PublicationJOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY
ISSN1939-8018
2019-10-16
Pages13
Corresponding AuthorYi, Jiangyan(jiangyan.yi@nlpr.ia.ac.cn)
AbstractA language model (LM) is an important part of a speech recognition system. The performance of an LM is affected when the domains of training data and test data are different. Language model adaptation is to compensate for this mismatch. However, there is no public dataset in Chinese for evaluating language model adaptation. In this paper, we present a public Chinese dataset called CLMAD for language model adaptation. The dataset consists of four domains: sport, stock, fashion, and finance. The differences in these four domains are evaluated. We present baselines for two commonly used adaptation techniques: interpolation for n-gram, and fine-tuning for recurrent neural network language models (RNNLMs). For n-gram interpolation, when the source domain and target domain are relatively similar, the adapted model can be improved. But interpolating LMs of very different domains does not obtain improvement. For RNNLMs, fine-tuning whole network achieves the largest improvement over only fine-tuning softmax layer or embedding layer. When the domain difference is large, the improvement of the adapted RNNLM is significant. We also provide speech recognition results on AISHELL-1 with the LMs trained on CLMAD. CLMAD can be freely downloaded at http://www.openslr.org/55/.
KeywordChinese dataset Language model adaptation Speech recognition N-gram RNNLM
DOI10.1007/s11265-019-01482-5
Indexed BySCI
Language英语
Funding ProjectNational Key R&D Program of China[2017YFB1002802]
Funding OrganizationNational Key R&D Program of China
WOS Research AreaComputer Science ; Engineering
WOS SubjectComputer Science, Information Systems ; Engineering, Electrical & Electronic
WOS IDWOS:000490530600001
PublisherSPRINGER
Citation statistics
Document Type期刊论文
Identifierhttp://ir.ia.ac.cn/handle/173211/26604
Collection中国科学院自动化研究所
Corresponding AuthorYi, Jiangyan
Affiliation1.Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
2.Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing, Peoples R China
3.Chinese Acad Sci, CAS Ctr Excellence Brain Sci & Intelligence Techn, Inst Automat, Beijing, Peoples R China
First Author AffilicationChinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
Corresponding Author AffilicationChinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing 100190, Peoples R China
Recommended Citation
GB/T 7714
Bai, Ye,Yi, Jiangyan,Tao, Jianhua,et al. A Public Chinese Dataset for Language Model Adaptation[J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY,2019:13.
APA Bai, Ye,Yi, Jiangyan,Tao, Jianhua,Wen, Zhengqi,&Fan, Cunhang.(2019).A Public Chinese Dataset for Language Model Adaptation.JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY,13.
MLA Bai, Ye,et al."A Public Chinese Dataset for Language Model Adaptation".JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY (2019):13.
Files in This Item:
There are no files associated with this item.
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[Bai, Ye]'s Articles
[Yi, Jiangyan]'s Articles
[Tao, Jianhua]'s Articles
Baidu academic
Similar articles in Baidu academic
[Bai, Ye]'s Articles
[Yi, Jiangyan]'s Articles
[Tao, Jianhua]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[Bai, Ye]'s Articles
[Yi, Jiangyan]'s Articles
[Tao, Jianhua]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.