CASIA OpenIR  > 毕业生  > 硕士学位论文







Other Abstract

Named entity recognition(NER) is aimed at recognizing entity spans and entity types from unstructured text, which is a fundamental task of natural language processing. NER for Chinese social media is meaningful for practical application and research. Compared with formal texts, social media texts change fast, while novel words and expressions will bring challenge to NER. However, existing research rarely takes this dynamic property into account, which results in a lack of evaluation of effects produced by dynamic property and corresponding solutions. Thus, this thesis focuses on the dynamic property of social media data and the research contents can be concluded into the following two aspects:

(1) A Chinese social media NER method on temporal drift

One important manifestation of dynamic property is the data updating over time. However, existing corpus cannot support research on temporal-related problems due to the lack of temporal information. To illustrate the effects of temporal order, this thesis constructs one novel Chinese social media NER corpus which retains daily time stamp, and then, conducts experiments to study the effects of temporal drift. Results show that temporal drift will lead to a quantity of unseen entities during inference, and existing methods cannot recognize unseen entities well. Therefore, this thesis proposes one data augmentation method called “temporal sampling substitution” to improve the ability of models to learn context features, which sets sampling probability by temporal order and plays entity substitution on training set to get new samples. Compared with baseline method, this proposed method can improve the performance effectively and is not sensitive to different models and languages.

(2) A method based on contrastive learning for Chinese social media NER

To improve performance of Chinese NER, making fusion of character feature and word feature based on outside gazetteer has been proved to be effective. However, this thesis finds that this kind of method has little improvement when doing research on temporal drift. This thesis attributes this phenomena to “static-dynamic” mismatch problem between gazetteer and social media texts. This problem will lead to abundant not-fullmatch words in words sequence input, which will bring boundary conflicts and semantic conflicts with golden entities. To solve this problem, this thesis proposes a method based on contrastive learning for Chinese social media NER, which utilizes both fine-grained boundary supervised signals in not-full-match words to alleviate boundary conflicts, and internal global supervised signals to alleviate semantic conflicts. Experimental results show that this proposed method can achieve state-of-the-art on multiple datasets, and make great improvement on social media texts especially.

The two methods proposed above can play roles independently because they belong to data dimension and model dimension respectively. Experimental results on social media data show that the performance can be better when we combine the two methods.

Keyword命名实体识别 社交媒体分析 信息抽取 时序偏移 对比学习
Document Type学位论文
Recommended Citation
GB/T 7714
江洲钰. 面向中文社交媒体的命名实体识别方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
Files in This Item:
File Name/Size DocType Version Access License
江洲钰_毕业论文_最终版.pdf(2118KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[江洲钰]'s Articles
Baidu academic
Similar articles in Baidu academic
[江洲钰]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[江洲钰]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.