CASIA OpenIR  > 毕业生  > 硕士学位论文
面向中文社交媒体的命名实体识别方法研究
江洲钰
2022-08-19
Pages58
Subtype硕士
Abstract

命名实体识别旨在从非结构化文本中识别出实体区间和类别,是一项基础性的自然语言处理任务,面向中文社交媒体进行命名实体识别则兼具实际应用意义与研究意义。相较于规范文本,社交媒体数据更迭快,新词和新的表达方式将给命名实体识别带来挑战。然而现有研究较少考虑这种动态性,对动态性造成的影响缺乏评估与针对性的解决方案。本文以社交媒体数据的动态性作为切入点展开相关研究,主要研究内容可归纳为如下两点:

(1)提出了一种面向时序偏移的中文社交媒体命名实体识别方法

随时间变化的数据迭代是社交媒体动态性的重要体现,但目前的评测数据集没有时序信息,无法支持对时序相关问题的研究。为验证时序的影响,本文构建了一套带有日粒度时间戳的中文社交媒体命名实体识别数据集,并对时序偏移问题进行了深入分析。结果表明,时序偏移的一大特点表现为新来数据中存在大量未见实体,而现有主流命名实体识别方法对未见实体识别较差。本文继而提出了一种“时序采样替换”的数据增强方法,在训练样本中依据时序远近设置采样概率进行实体替换以获取新样本,从而提升模型对实体上下文语义特征的建模能力。实验表明,相较于基线方法,该数据增强方法可以带来有效性能提升,且对具体模型、具体语言不敏感。

(2)提出了一种基于对比学习的中文社交媒体命名实体识别方法

借助外部词典融合字词特征是提升中文命名实体识别性能的有效方法,然而在上述针对时序偏移的评测中,这类方法收效甚微。本文认为原因在于静态的外部词典与动态的社交媒体数据间的“静态-动态”不匹配问题,此问题将导致输入词序列中存在大量与答案实体不完全匹配的词,引起边界冲突和语义冲突,最终影响模型性能。针对这一问题,本文提出了一种基于对比学习的中文社交媒体命名实体识别方法,充分利用不完全匹配词的细粒度边界监督信息以缓解边界冲突,通过数据内全局监督信息缓解语义冲突。实验结果证明,本文提出的方法在多个数据集上可达到当前最优,在社交媒体数据上效果尤为显著。

上述两种方法分属数据维度与模型维度的改进,作用独立。实验证明,组合两种方法后可在社交媒体数据上获得更佳效果。

Other Abstract

Named entity recognition(NER) is aimed at recognizing entity spans and entity types from unstructured text, which is a fundamental task of natural language processing. NER for Chinese social media is meaningful for practical application and research. Compared with formal texts, social media texts change fast, while novel words and expressions will bring challenge to NER. However, existing research rarely takes this dynamic property into account, which results in a lack of evaluation of effects produced by dynamic property and corresponding solutions. Thus, this thesis focuses on the dynamic property of social media data and the research contents can be concluded into the following two aspects:

(1) A Chinese social media NER method on temporal drift

One important manifestation of dynamic property is the data updating over time. However, existing corpus cannot support research on temporal-related problems due to the lack of temporal information. To illustrate the effects of temporal order, this thesis constructs one novel Chinese social media NER corpus which retains daily time stamp, and then, conducts experiments to study the effects of temporal drift. Results show that temporal drift will lead to a quantity of unseen entities during inference, and existing methods cannot recognize unseen entities well. Therefore, this thesis proposes one data augmentation method called “temporal sampling substitution” to improve the ability of models to learn context features, which sets sampling probability by temporal order and plays entity substitution on training set to get new samples. Compared with baseline method, this proposed method can improve the performance effectively and is not sensitive to different models and languages.

(2) A method based on contrastive learning for Chinese social media NER

To improve performance of Chinese NER, making fusion of character feature and word feature based on outside gazetteer has been proved to be effective. However, this thesis finds that this kind of method has little improvement when doing research on temporal drift. This thesis attributes this phenomena to “static-dynamic” mismatch problem between gazetteer and social media texts. This problem will lead to abundant not-fullmatch words in words sequence input, which will bring boundary conflicts and semantic conflicts with golden entities. To solve this problem, this thesis proposes a method based on contrastive learning for Chinese social media NER, which utilizes both fine-grained boundary supervised signals in not-full-match words to alleviate boundary conflicts, and internal global supervised signals to alleviate semantic conflicts. Experimental results show that this proposed method can achieve state-of-the-art on multiple datasets, and make great improvement on social media texts especially.

The two methods proposed above can play roles independently because they belong to data dimension and model dimension respectively. Experimental results on social media data show that the performance can be better when we combine the two methods.

Keyword命名实体识别 社交媒体分析 信息抽取 时序偏移 对比学习
Language中文
Document Type学位论文
Identifierhttp://ir.ia.ac.cn/handle/173211/49698
Collection毕业生_硕士学位论文
Recommended Citation
GB/T 7714
江洲钰. 面向中文社交媒体的命名实体识别方法研究[D]. 中国科学院自动化研究所. 中国科学院自动化研究所,2022.
Files in This Item:
File Name/Size DocType Version Access License
江洲钰_毕业论文_最终版.pdf(2118KB)学位论文 限制开放CC BY-NC-SA
Related Services
Recommend this item
Bookmark
Usage statistics
Export to Endnote
Google Scholar
Similar articles in Google Scholar
[江洲钰]'s Articles
Baidu academic
Similar articles in Baidu academic
[江洲钰]'s Articles
Bing Scholar
Similar articles in Bing Scholar
[江洲钰]'s Articles
Terms of Use
No data!
Social Bookmark/Share
All comments (0)
No comment.
 

Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.